Trade-offs in Distributed Learning
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 1 | |
Number of Parts | 10 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/20847 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Hand fanTerm (mathematics)Open setPartition (number theory)Maxima and minimaGradientOrder (biology)IterationMach's principleMathematical optimizationDiagonalGame theoryLogarithmPower (physics)Duality (mathematics)Discrete element methodAverageParameter (computer programming)Mortality rateProper mapTheoremSquare numberInsertion lossExtension (kinesiology)QuadrilateralExt functorSampling (music)WeightSample (statistics)Uniform convergenceStandard deviationMathematical analysisLinear mapMathematical singularityPoint (geometry)Derivation (linguistics)PermutationLine (geometry)ModulformGamma functionPotenz <Mathematik>Group actionVector spaceDifferent (Kate Ryan album)Functional (mathematics)Roundness (object)Numerical analysisRegular graphAlgebraic structureFood energyMoment (mathematics)Block (periodic table)Set theoryOrder (biology)GradientConcordance (publishing)Many-sorted logicMathematical optimizationForcing (mathematics)Modal logicComputabilityDivisorResultantPoint (geometry)Sampling (statistics)Term (mathematics)Distribution (mathematics)Mathematical analysisUniformer RaumPermutationIterationConvex setAverageModel theoryCorrespondence (mathematics)Natural numberStandard deviationAreaSquare numberLattice (order)Bound stateModulformInsertion lossRootEstimatorLocal ringMultiplication signDimensional analysisProjective planeScaling (geometry)Standard errorDirection (geometry)Combinatory logicParameter (computer programming)Decision theoryStudent's t-testFirst-order logicSimilarity (geometry)Water vaporIncidence algebraTheory of relativityGlattheit <Mathematik>FamilyMathematicsPhysical lawBendingMatching (graph theory)Product (business)DistanceCovering spaceTime domainMaß <Mathematik>Phase transitionConcentricSocial classUniverse (mathematics)Solid geometryGrothendieck topologyTheoryThermal expansionStatistical hypothesis testingLimit of a functionStochasticConnected spaceClassical physicsSequenceFlow separationComplex (psychology)MeasurementNoise (electronics)Condition numberLine (geometry)Expected valueMaxima and minimaGradient descentRight angleSelectivity (electronic)Independence (probability theory)Price indexVarianceCuboidTotal S.A.Object (grammar)Sinc functionCategory of beingOrder of magnitudeAbsolute valueBoundary value problemContrast (vision)Optimization problemConjugate gradient methodAxiom of choiceWeightGoodness of fitQuadratic formSign (mathematics)Quadratic functionMereologyInverse elementSlide ruleLinearizationApproximationWave packetPerspective (visual)SubsetArithmetic meanProcess (computing)Sheaf (mathematics)Constraint (mathematics)ParadoxRate of convergenceComputer programmingLeast squaresFrequencyLogical constantUniform convergenceChemical equationMortality rateCalculationConvex functionLogarithmOpen setDiscrepancy theoryFocus (optics)ExpressionState of matterRandomizationClique-widthFunction (mathematics)Gamma functionHessian matrix1 (number)Quadratic equationPartition (number theory)Ultraviolet photoelectron spectroscopyCartesian coordinate systemMatrix (mathematics)Coordinate systemNormal (geometry)StatisticsBlock matrixEqualiser (mathematics)CubeTheoremSupremumProof theoryQuasi-Newton methodLecture/Conference
Transcript: English(auto-generated)
00:17
Thanks. I actually also added the word optimization because the talk ended up being
00:22
a lot focused on a section of learning and optimization. So anyway, thanks for the introduction and thanks for inviting me. So this talk is actually based on several works, some of them together with my student Yossi El-Jevani, as well as works with Nati Srebo and Tong Zhang.
00:43
So when we talk about distributed learning and optimization, we're talking about a situation where we want to do some standard learning or optimization task but where the data is partitioned across several machines. And this can come in various ways. So it can be starting from having several cores on the same CPU through several nodes in
01:07
some cluster and up to some huge computing grid with different machines, maybe a different parts of the world. So distributed learning and optimization is something that has become very
01:23
popular in recent years. And there are two main reasons why we want to consider this setting. So one of them is when we have lots of data. So as we all know, we live in the era of big data. And sometimes the data is so large that we cannot fit it in a single machine.
01:41
So we need to distribute the data across several machines and do the learning based on that. So this is viewing the distributed setting as a constraint. But another situation where it's actually distributed as an opportunity is when we don't have one machine, we have k machines. So
02:01
hopefully, we'll be able to solve the problem k times faster. And there are several challenges when we want to do learning optimization in distributed setting. The first one is the issue of communication. So no matter which of the scenarios we're talking about, whether it's
02:22
cores on the same CPU or geographically distributed computing grid, communication is always something which is much, much slower than local processing. It takes much more time to send data between machines compared to each machine doing something locally on its own data. Usually, this is in orders of magnitude difference. So generally, we want algorithms which are
02:45
distributed, which communicate, but actually communicate as little as possible because it's expensive. The second challenge is how do we parallelize the computation? And the challenge here is that many standard learning and optimization algorithms are very inherently
03:02
sequential in nature. So if we look, for instance, on stochastic gradient descent, this is based on very sequentially taking an example, doing some update, then another example, then another update. And a priori, it's not clear how we take such kind of an algorithm and make it work in parallel. The third challenge is that we want the result to be accurate. So
03:25
obviously, we don't want to suffer because we distributed the computation. So the output quality should resemble what we could get with a non-distributed algorithm. So there are many ways to formalize this setting. The setting that we're going to focus on
03:48
is when we want to do something akin to empirical risk minimization and when the function that we want to optimize is convex. So we basically want to minimize some function
04:02
f, which we can write as an average of fi's. Each of these fi's represent the average loss on one of the machines and we want to minimize the average. And each machine in turn has n data points. So for simplicity, we're assuming each machine has the same number of examples, but everything I say can be easily generalized.
04:25
And the functions are generally convex. We can talk about situation similar to non-distributed optimization. We can divide this to scenarios where the functions are maybe strongly convex or smooth or both. And we'll discuss each of these scenarios later
04:42
on. And in terms of the communication, we generally assume that communication takes place in communication rounds where machines can broadcast to each other. And the amount of communication corresponds to sending order of d bits per machine per round where d is the dimension. So the idea is that the machines can send to each other maybe vectors or gradients,
05:08
but not things like Hessians, for instance, d by d matrices. And this corresponds to a big data high dimensional learning scenario where d can be very large and sending say d by d
05:22
matrices is not something which is feasible. And the main question in this setting is how can we sort of optimally trade off between these three requirements? How do we get an accurate solution with as little communication as possible and with as little runtime as possible,
05:42
ideally getting speed ups by parallelization? And notice that in this talk, when we're talking about accuracy, I'm going to focus on optimization error. So our goal is to minimize this empirical risk function. You can also talk about other goals. So for instance,
06:03
you can assume that the data is sampled IID from some distribution and your goal is to minimize the expected loss or the risk. And that sort of puts things in a slightly different perspective. Again, there are many ways and many settings one can talk about here, but this is the setting I will focus on. So I'm starting to make things a bit more concrete.
06:30
So in order to discuss the different results, we need to make various assumptions on how the data was distributed to the machines in the first place. What can we say about the relation between them? So one scenario is when we don't assume anything. So the data was
06:50
partitioned in some arbitrary way. Maybe one machine has the positive examples, another machine has the negative examples. This is one setting. At the other extreme, we may assume that the data
07:02
was actually partitioned and random. So we had a bunch of data points that were just assigned uniformly at random to the machines. And then the situation from the algorithm designer perspective is potentially improved because now there are stronger relationships between the data across
07:23
machines. So for instance, we have various concentration of measure effects, the values or the gradients of the local functions of each machine are related. And as we'll see later, this is something that we can utilize. Another setting that is interesting to talk about, which in some sense generalizes the previous two, is a delta related setting where we assume
07:44
that there are relationships between the values or the gradients of the local functions at any point. So again, if the data is partitioned at random, you really have this kind of
08:02
situation where delta is pretty small. But you can also discuss here more general things. Maybe there are statistical similarities between the data points, but maybe it wasn't exactly partitioned at random. So in some sense, it lies between the arbitrary and random partition scenarios. So basically, what we're going to do in this talk is to discuss each of these
08:26
three scenarios and discuss both upper and lower bounds, mainly in terms of the amount of communication. And also discuss the runtime though. And regarding the random partition,
08:45
so the results there are actually going to align some very new results, which might be independent interest actually, having to do with without replacement sampling in stochastic gradient methods. So I'll talk about that, but also point out how it gives
09:02
something new for distributed learning with random partition. OK, so let's start with the arbitrary partition scenario. So we don't assume anything about what's the relationship between the functions of the different machines. And maybe the simplest baseline here that one can think of is just to reduce it to standard first order
09:26
non-distributed optimization. So we can ignore the fact that we are in a distributed scenario where this f is an average of functions of different machines. And we can just have each machine compute gradients of this capital F. This requires one communication round. So each
09:46
machine computes the gradient of the local function. Then they do a communication round to average. And that gives us basically an oracle to compute gradients of big F. And now we can plug it into any kind of black box first order algorithm. For instance,
10:03
gradient descent. And then you get an algorithm where the number of communication rounds is the same as the number of iterations of this algorithm. So you can do gradient descent. You can also do all of the other things that people do in standard optimization. You can do accelerated gradient descent. You can do smoothing, and so on. And that gives you
10:24
upper bounds on the number of communication rounds you would need. So if the functions, local functions, are strongly convex and smooth, you get the number of communication rounds will scale like square root of the condition number 1 over lambda if the functions are
10:42
strongly convex and smooth. I can also talk about just non-smooth, but lambda strongly convex, convex, and so on. You just derive it from the standard upper bounds for these algorithms. And on one hand, this is a very nice and simple approach. It's also almost fully
11:03
parallelizable because most of the time, each machine just computes the gradient of its own local function. But it does require a relatively large number of communication rounds. So when we do large-scale learning, problems in high dimensions, the lambda usually comes from
11:23
an explicit regularization that we add to the problem. And for statistical learning considerations, usually it's quite small. It actually decays with the amount of data that you have. So lambda is generally quite small. And then you may need to do many communication rounds, or maybe also depending on
11:41
epsilon where epsilon is a desired accuracy. Now, so of course, yes? Just to be clear, so everything is fully synchronized, like your communication rounds is like one shot? Yes, it's a very simple, naive, simple baseline that you can always do. Now, of course, as you might suspect,
12:04
there are probably more sophisticated things you can do. And there actually have been a lot of work in recent years on algorithms for such a situation from ADMM, CoCo, CoCo Plus, many others. But at least in the worst case,
12:21
do they actually improve on this simple baseline? And the answer maybe surprisingly is no. So again, at least in the worst case, over say all strongly convex and smooth functions, you can't get something better in terms of the number of communication rounds, at least for a very, very large family of algorithms.
12:42
So basically, these are algorithms which fall into the following template. So each machine implicitly has some kind of set of vectors wj. And what the machine can do between communication rounds is sequentially compute vectors, which are
13:05
basically either linear combinations of the vectors it computed so far, or gradients of the local function at that point, or even things like multiplying the points with Hessians.
13:20
And actually, it doesn't even have to be that the point it computes is in the span of these things can be also a linear combination of the point and its gradient. So for instance, that allows us to also consider algorithms which solve some kind of local optimization problem at each iteration. And the machines can actually do this for as long as they want,
13:46
in terms of the lower bound that I'm going to show. And during communication round, they can basically share some of the vectors they have computed. So I don't know if it covers every possible imaginable algorithm for this problem, but it
14:04
does cover the kind of reasonable approaches that at least I can think of. Yes. What do you mean like gamma need to be positive? Yeah, this is just for technical reasons. So the point is that if the gamma is negative, it means that you may be able to solve at every round local optimization problems,
14:27
which are non-convex. And this is actually something that would break the lower bound, but it's also if you limit yourselves to algorithms which are based on convex optimization,
14:42
then you basically have these factors which are positive. So you need gamma and nu to have the same sign, that's it? Yes. Yeah. OK. And I'll show you the proof idea. It's actually very simple. I'm going to focus on the case where we just have two machines.
15:01
And the local function of this machine will be just a quadratic function. Only one of them will have a linear term where e1 is the first standard unit vector. And a1 and a2 are two matrices which have the following form. So it's sort of they are block diagonal where the blocks are overlapping.
15:24
So what's the idea? So let's consider the first machine before any communication was done. So because of the bias term, it would be able to compute a vector with a non-zero in the first coordinate. But then before communicating, it won't be able to manufacture any vectors
15:43
with non-zero values except in the first coordinate. Once a communication round happens, the machine will be able to make the second coordinate, actually even the third coordinate non-zero. But again, it would get stuck,
16:01
again, because of the structure of these matrices. So basically, the number of communication rounds limit how many coordinates we can make non-zero in the vectors that the machines compute. But the optimum of this problem
16:24
actually requires all coordinates to be a non-zero. And if only the first few coordinates are non-zero, that gives us a lower bound on the error. So after t iterations, the error can be no less than exponential in minus t over square root of 1 over lambda.
16:45
And that basically gives the lower bound for the strongly convex and smooth case. As some of you might recognize, if you look at how the global function looks like, the average of f1 and f2, this is essentially the kind of hard function that's being used to prove lower bounds for first order algorithms in a non-distributed
17:07
optimization scenario. So construction is the same, but here we actually make different structural assumptions. So again, as I said in the previous slides, the machines can compute the local Hessians and multiply it with things. That's fine.
17:24
And still, the lower bound would hold. You can also do similar things to get results for, say, non-smooth functions. The basic idea is still the same, but the construction is different. So without smoothness, we again create two functions with this kind of interlocking
17:43
structure. But now it's not a quadratic form. It's with absolute values. And we get 1 over lambda t squared lower bound, which is matched if you do the simple baseline I talked about specifically with accelerated gradient descent and with
18:02
proximal smoothing. Any questions about this? So next, we'll discuss the delta-related setting, which, as I said, is a situation where we assume, again, I'm not going to assume that the data was necessarily partitioned
18:23
at random, but still, I will assume that the functions have similar values or gradients or Hessians at any point in the domain. And you can actually give lower bounds very similar to the ones I showed earlier, but where you now have these delta factors
18:43
making the lower bounds weaker. And the question is, can we get upper bounds? Can we get algorithms which utilize a delta-related setting and require less communication in a way which depends on this delta? Yes. Is it surprising that the lower bound does not depend on m? Is it obvious, or is it just like
19:04
fact of life? Well, I talked just about the m equal to scenario. If you talk about more, then things become a bit different. May the matching upper and lower bounds may still depend on m? Yes, that could be. These
19:23
bounds do not depend on m, but yeah. In general, at least in this talk, I think of m, the number of machines as being generally a constant, but I agree that it's a good question to understand what happens when it's not. So again, so for the delta-related setting,
19:41
can we get algorithms for this? Maybe a different way to think about this question. So when we have a lot of data that we distribute between the machines, and if we have concentration of measure effects, it means that the delta actually becomes smaller and smaller. So in some sense, this is a situation where maybe by having more data,
20:02
we can reduce the amount of communication that our algorithm needs, because delta would become smaller. So I'm going to talk about one algorithm, which does have this kind of nice dependence on the delta. There have been follow ups to it since that I'll briefly
20:22
mention at the end. So the algorithm I'm going to talk about is called the DANE, short for distributed approximate Newton type method. For those of you who know the ADMM algorithm, the structure is very similar. So it's an iterative algorithm where each time each machine solves some local optimization problem of the following form. And then the
20:45
machines communicate to each other average gradients and average solutions. So after solving the local optimization problem, they compute the average. So this is the entire algorithm. What is the intuition here? So the crucial property is that
21:06
this algorithm is essentially equivalent to doing an approximate Newton step. So what is a Newton step? So if the problem is sufficiently smooth, and we have Hessians,
21:21
then one of the classical ways to do a optimization is to iteratively do something like the following. This has a very fast convergence, quadratic in general, but we can't implement it here because it requires us to compute and invert Hessians.
21:41
And as we said, actually, computing and communicating Hessians is pretty expensive. Now, in our setting, an equivalent way to write the Hessian is as the average of the local Hessians. And what it turns out that at least for quadratic functions, DANE is equivalent
22:02
to do steps which are not of this form, not like a Newton step, but of this form. So ignore the mu i for now. Here we have the average of the Hessians inverse, whereas here we have the average of the inverse of the local Hessians. I should emphasize
22:23
that the algorithm doesn't explicitly compute these Hessians. So even if the dimension is huge, you don't need to store d by d matrices. Rather, implicitly, by solving these local optimization problems and updating, this is essentially what you do. Okay, so I solve this local problem.
22:53
This can be complex to do. Well, it depends. I do have regularization here, so I could do things like SAG or SVRG or SDCA.
23:04
You could also solve a Newton step by conjugate gradient with a fast algorithm as well. Yes, but to do conjugate gradient, the number of iterations you need to do would either scale with the dimension or with square root of the condition number. So you can do that, but the number of communication rounds would scale with the condition number.
23:23
So you don't get this improvement in terms of the number of communication rounds. Okay, so this thing and this thing are not the same because we invert the order of the inverse and the average. But the point is that we're talking
23:42
about a situation where these Hessians are similar. We're in the delta-related setting. If they were exactly the same, there would have been no difference between this term and this term. In the delta-related setting, they are different, but just a little bit. And the difference is quantified by this delta, which allows us to give a convergence guarantee,
24:07
which is basically the following theorem. So the idea is that every iteration, we shrink the distance to the optimum by something which depends on h tilde minus one and h.
24:20
So h is the actual Hessian. H tilde minus one is the average of the inverse Hessians. Again, if all of the local Hessians are exactly the same, h is just the inverse of h tilde minus one and their product would be just i. And then setting eta to be one, I actually get convergence with just one iteration.
24:48
The more realistic setting where they're only delta-related, so this thing won't be exactly zero but rather something that would depend on delta and lambda, the strong convexity parameter.
25:03
And overall, you get an algorithm where the number of communication rounds is logarithmic in the required accuracy epsilon and depends on the square of delta over lambda. So if delta, for instance, is as small as lambda, it means that the number of communication rounds is just logarithmic in the accuracy and independent of everything else.
25:26
Just to give you an illustration of this algorithm, so this is on synthetic data, although we also did some experiments on real-world data. So here we compare DANE to maybe one of the most popular algorithms for this problem,
25:40
namely ADMM. The left column is for four machines, the right column is for 16 machines, and the different lines correspond to a different amount of data, which was randomly partitioned between the machines. So you can clearly see that the DANE algorithm, as each machine gets more and more data, the relatedness of the local functions becomes stronger, then delta shrinks.
26:08
And indeed, the number of communication rounds that you need, so here the x-axis is the number of communication rounds, this is log of the optimization. So the number of communication rounds decreases.
26:21
In contrast, ADMM doesn't utilize relatedness between the local functions. So even if each machine gets more and more data, the convergence rate remains the same. Now, the guarantees I talked about are just for quadratic functions.
26:40
We can also provide some guarantees for non-quadratic functions, but they are a bit weaker. I won't discuss them. And as I said earlier, there have been some follow-ups to this work since. For instance, Yuqing Zhang and Ling Xiao had this very nice paper last year, where they proposed a different and somewhat more sophisticated algorithm called DISCO
27:02
for the same setting as ours, where they improved the dependence on the ratio between delta and lambda. So we had delta over lambda squared. They have just square root of delta over lambda. So these are very nice algorithms.
27:21
One thing that should be kept in mind about them is that they're still not necessarily very, very cheap in terms of runtime, because we still need to solve some local optimization problem at every round, which in practice is a little bit expensive. So in terms of the amount of communication, it's generally small.
27:43
But in terms of runtime, it might not be the best possible. Another thing I should point out is that currently the kind of analysis that we have is either for quadratic functions, for the algorithm of Zhang and Xiao,
28:05
they were able to extend it to self-concordant losses under some assumptions. Also, the guarantees are slightly weaker. We still don't have an algorithm for the delta-related setting, which is safe for general, strongly convex, and smooth losses.
28:23
At least one where you get a good dependence on delta. Yes? Between this method and the one you just presented, there's like a power of four, which is different. So you have a square and there you have the square. Yes. What's the gain of, for example, what you're presenting over this one, if there's any?
28:42
Well, in practice, the difference between them is not necessarily that dramatic. Well, I'm talking about this algorithm because that's the algorithm I worked on. It gives a simple way to take advantage of the delta.
29:01
But in many cases, that algorithm could work better. But I prefer to talk about my own work rather than someone else's work. Do you think you could get away with just smoothness and not self-cordance? Yeah, it's a good question. So the difficulty is that both in our algorithm and their
29:25
algorithm, it's something very similar to quasi-Newton methods. And to do the analysis correctly, for any kind of algorithm from the Newton family, it's very difficult to get
29:40
something satisfactory without assuming something like self-cordance. In practice, I don't think it's really necessary in terms of practical performance. Both these algorithms, you can easily run on anything. But the analysis, I don't know how to do. Are they like worst-case scenarios where Newton does not converge in a smooth function,
30:00
have no clue? If it's not smooth, you might not even have a Hessian. But for the sake of a smooth function, but not more, like I said, the squared interest, can you show it does not converge in some situations? That's a good question. I don't know, actually.
30:20
It does work in practice. In practice, it does work. But I'm not sure if we know how to analyze it. If we're the standard Newton algorithm, we don't know how to analyze it in this kind of distributed setting. It's only more difficult. Moving to the last scenario I'll discuss, maybe the easiest in some case is when the data
30:44
is, in fact, randomly partitioned between the machines. So again, this is a special case of the delta-related setting, where delta is something like 1 over square root number of data points per machine. But can we utilize the fact that it's a random partition to get even better results
31:04
than in the delta-related setting? So just to translate the results of the previous algorithm, if we try to understand what is the regime where we can get a small number of communication rounds, something which is just logarithmic in the required accuracy, then the previous
31:24
algorithms give you that as long as the strong convexity parameter is at least 1 over square root of n, which is OK in many cases, but not always because, again, this lambda, the strong convexity parameter, often comes from some explicit regularization
31:42
that we add, and often it decays with the number of data points. Usually, in the literature, what you see is something between 1 over square root the number of data points and 1 over the number of data points. So this gives us one extreme of this regime, but many times we do want to use smaller
32:01
regularization, and then the number of communication rounds is not as good. In contrast, in the random partition scenario, I'm going to discuss actually a much simpler approach also in terms of the algorithm that does allow you a log of 1 over epsilon communication rounds as long as lambda is 1 over n, at least up to log factors.
32:24
So we get this nice behavior in terms of communication for a much broader choice of the regularization parameter lambda. But to explain this approach, I will need to take a detour and talk a bit about without replacement sampling for stochastic gradient methods,
32:42
after which we'll return back to the distributed setting. So we forget about the distributed setting for now. We just want to look at the situation where we have some function, which is the average of many individual functions, and we want to optimize it. So a very, very popular family of algorithms to do this is stochastic gradient methods.
33:06
So if we do something like stochastic gradient descent or the sub-gradient method, what we basically do is that each time we sample one of these functions fi, we compute the gradient or sub-gradient at the current point. We take a step along that direction and project back to the domain if needed.
33:28
And now, the way that the standard analysis works is assuming that these it's, these indices are sampled uniformly at random and independently from 1 to n.
33:41
And that works because then each gt is an unbiased estimate for a gradient of the function I actually want to optimize, f of w. But there is actually a certain theory practice discrepancy here, because in practice, quite often, it's much better to do sampling, not with replacement, but without replacement.
34:03
So if I already sampled some index, I don't sample it again. A different way to think about it is that I pick some permutation over the indices, uniformly at random, and then just go over the data according to that order.
34:20
So I just do a random chap full of the data and go over it. And maybe I can then reshuffle the data and go over it again. And not only it works better, not only it often works better in practice, you get faster convergence, it's also often much easier and faster to implement.
34:43
Because going over data in sequential order because of caching effects, or if the data resides on some slow device, it's much faster to go sequentially than through random access. Now, intuitively, this without replacement sampling works better
35:03
because I am sort of forcing the algorithm to process all the data equally. If I sample with replacement, it only happens on average. But it turned out to be very difficult to analyze these stochastic gradient methods when we do sampling in this way, because now the updates are correlated.
35:23
I no longer pick the indices independently from everything before. There have been a little bit of work in this direction. So there are classical results for incremental gradient methods. So basically, convergence bounds which work no matter in which order I go over the data.
35:43
But I'm not assuming any kind of randomness here. The bounds are much weaker. You can actually show that, at least in some cases, they can be exponentially slower than with replacement sampling. Very recently, there was a very interesting work by Goebbels, Valabanos, de Blair and Parillo,
36:03
which tried to analyze stochastic gradient descent for strongly convex and smooth problems, and showed that if you do sufficiently many passes over the data, doing sampling without replacement, then eventually you do get a small error. So as k gets larger, you get a decrease here,
36:23
which is like one over k squared versus one over k in the width replacement setting. However, they also have a very strong dependence on the number of data points. So just to make the balance here non-trivial, you need to do at least n passes over the data. If you want to be better than width replacement,
36:42
k has to be something like n cubed. And this is a little bit unsatisfactory because the case where we want to use stochastic gradient methods to begin with is when we don't want to do many passes over the data. If you're willing to do many passes over the data, there are actually much better methods. So just do plain gradient descent or accelerated gradient descent
37:04
or fast stochastic methods. So in a situation where k is small or maybe even one, these results don't tell us much, unfortunately. So what I'm going to talk about next is some new results,
37:21
which give an analysis for stochastic gradient methods without replacement sampling. Our goal is a little bit more modest in the sense that we won't show that without replacement is strictly better, but at least we show that it's not worse in a worst case sense. So again, considering scenarios like a strongly convex and smooth functions
37:46
or convex functions, we get the same kind of convergence rates as in the width replacement sampling case. So we talk about various scenarios,
38:01
either convex or lambda strongly convex and smooth, and also an analysis for a new replacement version of the SVRG algorithm, which is the one that would relate later on to distributed learning. So I'll explain a little bit how this kind of analysis works. It basically uses ideas from stochastic optimization, but also from adversarial online learning and transductive learning theory.
38:26
So if you're not familiar with these, don't worry. I'll explain as we go along. So the simplest maybe to explain is the situation where the functions are just convex and Lipschitz, each fi,
38:42
and we look at an algorithm which sequentially processes these functions according to some random permutation producing iterates. And our goal is to prove that in expectation, the average suboptimality of the iterates is order of one over square root of t.
39:08
And so based on this, you can argue that if you pick a single wt with t chosen uniformly at random expectation, this would be the boundary. You can take the average of the w's. Basically, this is the kind of convergence bounded we would like.
39:26
And actually, I'm not going to talk about a particular algorithm. All I would require is that the algorithm will have a regret bound in the adversarial online learning setting, which is a situation where these functions,
39:41
I basically don't assume any kind of statistical assumptions about them. They're arbitrary functions, but maybe convex and Lipschitz. And then I want that the average loss of the points wt that the algorithm produces is only one over square root of t worse
40:02
than the average loss of any fixed point w in my domain. So for instance, this holds for stochastic gradient descent, but also other algorithms. And the proof idea I can basically show in two slides.
40:21
I'm not sure I'll have time to go over every point, but this is the following. This is the thing that we want to bound. I'm using the fact that this sigma, this permutation is chosen uniformly at random. So in expectation, the marginal of f sigma t is just a big F.
40:45
I add and subtract terms. And then I apply the regret bound that I assume. So this allows me to upper bound this whole thing by one over square root of t. The second term I write a little bit differently and do some simple algebraic manipulations.
41:07
And what they end up with is this bound, where what I have here is the expected difference between the average loss of wt on the losses seen so far minus the average loss
41:22
on the losses that still weren't seen according to the permutation. And now I'm going to do something which might appear to be very loose. So I upper bound the expectation term by the expectation of the supremum over every possible point w in my domain.
41:43
So why do I do this? I do this because what basically I'm asking when I'm trying to bound this expression is, I had my data. I randomly partitioned it, fixed some t. I randomly partitioned it to a group of size t minus 1 and a group of the rest.
42:06
And I ask for any given point w how large can be the difference in the average loss between these two things. So because of concentration of measure and uniform convergence effect, this thing can generally be small.
42:22
And actually, this is exactly the term that has been studied in transductive learning theory. So in transductive learning, we have some fixed data set, which is split to a training set and a test set. And then you can ask what is the difference between the empirical risk of the training set versus the risk or the average loss on the test set.
42:45
And there's been an entire theory developed exactly for these things. In particular, it can be shown that you can upper bound this expectation of the supremum by a transductive version of Rademacher complexity, which is used to...
43:00
What is n of t here? So n is the total number of data points. And t is the number of iterations the algorithm does. So you sample a single permutation and you do t steps? Yes. So here I'm assuming that t is less than or equal to n.
43:25
I do just one random shuffle of the data. Actually, everything I do here can be generalized to a situation where you do repeated reshufflings. But at the end t will not be n. Yeah, so actually, this term will dominate this term. But that's okay, because I just want to end up with a bound which is like 1 over square root of t.
43:44
Do you need t to be less than n or at the end... No, it could actually equal n, that's fine. So you could get a full class? Yes. Okay, so would you want to be more general? So you're allowing for less than a full class?
44:01
I mean, the analysis allows me to do that. Okay, so this kind of expectation of supremum, you can upper bound by a Rademacher complexity and the Rademacher complexity of the domain w and then you can apply, you can instantiate it for various cases.
44:22
Maybe the simplest one is if we talk about linear predictors with bounded norm and the losses are convex in Lipschitz, we get the 1 over square root of t rate. And actually, you can also show that all the parameters hidden in the O notation, the norm of the predictors and the Lipschitz constant are all correct.
44:43
You get the same thing as in the width replacement case up to maybe constants. And you can also do a more sophisticated version of this analysis to get say, 1 over lambda t with lambda strong convexity.
45:01
Here, you do need to work harder because uniform convergence doesn't give you 1 over t rates, but it turns out you can do some trick to get around it. I won't have time to go into the details. But to start getting back to the distributed setting, I want to focus on the results for the SVRG algorithm,
45:24
which belongs to a family of algorithms, most of them developed over the past few years, including by members in the audience here, which are exactly targeted at solving optimization problems of the following form.
45:40
They have cheap stochastic iterations like stochastic gradient descent, but their convergence rate is linear to get epsilon accuracy, number of iterations only scales logarithmically with epsilon. And all the analysis I'm familiar with are strongly used with replacement sampling.
46:00
And we instead consider without replacement sampling, and we picked in particular the SVRG algorithm. So the algorithm, the standard width replacement version has the following form. It's a very simple algorithm. It goes in epochs. In each epoch, we compute one full gradient,
46:21
so the gradient with respect to the function we actually want to optimize. And then we do t stochastic iterations, where each time we pick one individual loss uniformly at random and do an update which has this form. An expectation still corresponds to the gradient of f,
46:41
but these terms here ensure that the variance, the noise that we have in this update gets smaller and smaller over time. And the standard analysis is that basically you need to do log of one over epsilon epochs overall. And in each epoch, the number of stochastic iterations needs to be at least one over lambda.
47:07
Now, in a recent paper, Jason, Li, Lin, and Ma made this very nice observation that actually you can take this algorithm and apply it basically as is for distributed learning.
47:22
So in distributed learning, the individual functions fi are distributed across different machines, but still the machines can simulate this algorithm. So each time they do a communication round to compute the full gradient. And then each machine runs these cheap iterations on a subset of their data.
47:45
Now, there is a difficulty here because the algorithm requires width replacement sampling. And we're talking about a situation where the data was partitioned at random. So it doesn't correspond to sampling with replacement,
48:01
but at least as long as the number of iteration is one over, sorry, this should be a square root of n, as long as you sample less than square root, the total number of examples by the birthday paradox, with and without replacement sampling is more or less the same. So this thing would work.
48:21
So when I take this constraint and plug it into the analysis, it means that we get this way in algorithm for distributed learning optimization. However, it's only applicable when the strong convexity parameter is at least one over square root of n, which as we said earlier is quite a bit restrictive.
48:43
So what instead you can do is simply do the same thing in the same algorithm, but this time use without replacement sampling, which fits much more with the random partition data that we deal with. So it's exactly the same as in the previous slide, but instead of each time picking individual loss independently,
49:06
we fix some permutation over the data and do the update according to this permutation. And now this is something that I can actually simulate with random partition data all the way up to an order of n, order of the number of data points in t here.
49:25
Again, maybe up to log factors. You do less than a single pass over the data? Yes. The point here is that I have these expensive gradient calculations, which require a full pass on the data.
49:40
But the number of stochastic iterations I need to do is actually less than the size of my data. So that's- We have tried to do like a fixed random permutation and do several passes. It does diverge quickly, but the step size has to be much smaller. Yes. So I'm going to talk about the analysis in a moment,
50:02
but it's very important that if you do several passes over the data, you reshuffle the data each time. Otherwise, the analysis doesn't work. But I think that this was also- It was noticed that with these algorithms, if you don't re-permute each time,
50:21
it can either converge poorly or not at all. Is the basic idea of the algorithm clear? So this is an algorithm you can apply on anything, but what can we say in terms of rigorous guarantees? So currently, we can give a bound for without replacement SVRG,
50:43
but only for a regularized least squares. This is for technical reasons, as far as I can discern, but this is what we can currently do still. It's an important setting. So again, using the same kind of parameter choices, log one over epsilon epochs and one over lambda
51:01
without replacements stochastic iterations, we get a similar kind of convergence rate as the width replacement case. So what this means in the context of distributed optimization is that at least for regularized least squares, we can find an optimal solution.
51:24
Actually, two implications here. So for non-distributed optimization, it means that if we just want to do without replacement version of SVRG, we actually don't need to do any data reshuffling all the way up to lambda being around one over the data size,
51:42
which again is good in situations where access to the data is expensive and doing this reshuffling is not something you want to do too many times. In the context of distributed optimization, again, where lambda is at least one over the data size, you get an epsilon optimal solution with randomly partitioned data, and you only need to do a logarithmic number of communication rounds.
52:03
And because of the structure of the algorithm, the runtime is actually dominated by this full gradient computation, which is fully parallelizable because each machine can just compute its own gradient on its local data and only do an averaging at the end.
52:22
So also in terms of runtime, you get runtime speed up by using more machines. So in your first application, so you shuffled there once, or it's just randomly, and then you never changed the permutation? Yeah, you just need to do it once.
52:42
You do need to pass over the data several times, but you don't need to change the order. It's sequential passes. Because that seems to contradict the... I guess probably the log over one epsilon, the constant in front of it, has an end time worse term, probably.
53:03
That's the slowdown that Francis mentioned. Like you need to do much, much smaller step size. No, so actually, again, the point is that here, when I say you don't do data reshuffling, it's because the number of stochastic iterations you do is not larger than the data size.
53:21
You do more than one pass because you need in each epoch to compute the full gradient. But to do the stochastic iterations, you don't touch a data point more than once. And that is important. Otherwise, I don't know how to get this kind of... Between two epochs, do you need to change your permutation? No.
53:40
So in this case, my guess, to reconcile with the fact that, empirically, when you don't change the permutation, you need to use much smaller step size? Empirically, we are using like SAG or whatever, or SAGA, which does not recompute the full gradient. So you recompute the full gradient. At a given point, you start from scratch.
54:01
And also, that's a good point. This is also an analysis for SVRG. I don't have an analysis at the moment for SAG or SDCA. And the structure of these algorithms is also different. So here, I'm really utilizing the fact that this algorithm uses full gradient computations in a small number of stochastic iterations.
54:24
SAG or SDCA need to do... They don't have a full exact gradient computation. Instead, they have more stochastic iterations. So that does break the analysis here, because it does require you, in the stochastic phase,
54:41
to touch a data point more than once. Yes? Right. So if you don't reshuffle between the epochs, that means that there are some data points which you never see in your stochastic iterations. But is that a... Yes, that is true. You see them in the full gradient computations.
55:02
Any other questions? I think this is more or less the end. So to summarize, I talked about three scenarios in distributed learning and optimization,
55:21
and gave all kinds of results in each of them. So maybe the one that we currently, I think, understand best, at least in terms of worst case guarantees, is the arbitrary partition case, where maybe a bit disappointingly, the simplest baseline is also worst case optimal. In the delta-related settings, the situation
55:41
is that we can handle quadratic functions pretty well, maybe self-concordant losses, but we currently don't have algorithms for, at least with provable guarantees, for generic strongly convex and smooth functions. And also the kind of algorithms that we have are a bit heavy. They need to actually solve an optimization problem
56:02
at every iteration. And finally, in the random partition, currently we have provable guarantees for least squares. The algorithm I presented based on the SVRG, I think, we still haven't experimented on it too much, but I suspect it would work generally
56:21
on strongly convex and smooth functions, but we don't have an analysis for it at the moment. Actually, it is possible to give some kind of analysis, but again, it's only a situation where lambda is very large, so large that there's actually not a difference between with and without replacement sampling.
56:44
So it's not a very interesting result. So I think I'll stop here. Thank you very much.
Recommendations
Series of 10 media