Tradeoffs in Distributed Learning
4 views
Formal Metadata
Title 
Tradeoffs in Distributed Learning

Title of Series  
Part Number 
1

Number of Parts 
10

Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 
Institut des Hautes Études Scientifiques (IHÉS)

Release Date 
2016

Language 
English

Content Metadata
Subject Area  
Abstract 
In many largescale applications, learning must be done on training data which is distributed across multiple machines. This presents an important challenge, with multiple tradeoffs between optimization accuracy, statistical performance, communication cost, and computational complexity. In this talk I'll describe some recent and upcoming results about distributed convex learning and optimization, including algorithms as well as fundamental performance barriers.

00:00
Flow separation
Process (computing)
Constraint (mathematics)
Distribution (mathematics)
Sheaf (mathematics)
Mereology
Mathematical optimization
Subtraction
Hand fan
Condition number
02:06
Point (geometry)
Maxima and minima
Multiplication sign
Correspondence (mathematics)
Insertion loss
Open set
Order of magnitude
Number
Roundness (object)
Natural number
Term (mathematics)
Average
Subtraction
Mathematical optimization
Area
Standard deviation
Focus (optics)
Process (computing)
Gradient
Term (mathematics)
Functional (mathematics)
Vector space
Convex set
Mathematical optimization
Local ring
Resultant
05:54
Theory of relativity
Distribution (mathematics)
Insertion loss
Mathematical model
Term (mathematics)
Functional (mathematics)
Perspective (visual)
Open set
Expected value
Partition (number theory)
Order (biology)
Subtraction
Resultant
06:40
Point (geometry)
Partition (number theory)
Maxima and minima
Computer animation
Concentric
Gradient
Gradient
Similarity (geometry)
Functional (mathematics)
Perspective (visual)
Measurement
Maxima and minima
08:22
Area
State of matter
Gradient
Chemical equation
Firstorder logic
Gradient
Physical law
Functional (mathematics)
Number
Maxima and minima
Partition (number theory)
Roundness (object)
Computer animation
Term (mathematics)
Average
Iteration
Subtraction
Mathematical optimization
Resultant
Gradient descent
10:13
Gradient
Multiplication sign
Gradient
Regular graph
Functional (mathematics)
Number
Glatte Funktion
Partition (number theory)
Order (biology)
Roundness (object)
Computer animation
Root
Iteration
Convex set
Square number
Mathematical optimization
Local ring
Condition number
Gradient descent
11:50
Partition (number theory)
Order (biology)
Roundness (object)
Computer animation
Iteration
Term (mathematics)
Gradient
Family
Functional (mathematics)
Number
Maxima and minima
12:42
Point (geometry)
Slide rule
Divisor
Weight
Number
Mach's principle
Roundness (object)
Root
Manysorted logic
Iteration
Term (mathematics)
Average
Square number
Game theory
Units of measurement
Covering space
Standard deviation
Quadratic function
Block (periodic table)
Optimization problem
Forcing (mathematics)
Gradient
Exponentiation
Algebraic structure
Functional (mathematics)
Bilinear form
Partition (number theory)
Computer animation
Vector space
Combinatory logic
Universe (mathematics)
Order (biology)
Iteration
Diagonal
Mathematical optimization
Mathematical optimization
Matching (graph theory)
Local ring
17:19
Area
Gradient
Bound state
Algebraic structure
Quadratic form
Functional (mathematics)
Mach's principle
Glatte Funktion
Partition (number theory)
Computer animation
Iteration
Square number
Absolute value
Resultant
18:06
Point (geometry)
Group action
Divisor
Gradient
Logarithm
Similarity (geometry)
Insertion loss
Discrete element method
Number
Duality (mathematics)
Iteration
Subtraction
Theory of relativity
Concentric
Physical law
Gradient
Bound state
Incidence algebra
Set (mathematics)
Functional (mathematics)
Time domain
Measurement
Power (physics)
Maxima and minima
Computer animation
20:11
Group action
Proper map
Multiplication sign
Optimization problem
Gradient
Algebraic structure
Parameter (computer programming)
Average
Bilinear form
Mortality rate
Approximation
Maxima and minima
Category of being
Computer animation
Mathematical optimization
Local ring
21:48
Proper map
Computer animation
Average
Hausdorff dimension
Optimization problem
Inverse element
Functional (mathematics)
Bilinear form
Local ring
22:49
Scaling (geometry)
Proper map
Parameter (computer programming)
Average
Regular graph
Mortality rate
Number
Maxima and minima
Roundness (object)
Computer animation
Root
Conjugate gradient method
Term (mathematics)
Hausdorff dimension
Order (biology)
Square number
Iteration
Subtraction
Condition number
24:06
Parameter (computer programming)
Distance
Number
Roundness (object)
Iteration
Average
Square number
Contrast (vision)
Subtraction
Product (category theory)
Theory of relativity
Rate of convergence
Independence (probability theory)
Bending
Set (mathematics)
Line (geometry)
Functional (mathematics)
Maxima and minima
Computer animation
Convex set
Theorem
Iteration
Right angle
Square number
Local ring
26:35
Axiom of choice
Standard deviation
Gradient
Decision theory
Direction (geometry)
Multiplication sign
Water vapor
Insertion loss
Parameter (computer programming)
Permutation
Maxima and minima
Roundness (object)
Insertion loss
Meeting/Interview
Square number
Sampling (music)
Area
Optimization problem
Gradient
Sampling (statistics)
Uniform convergence
Ext functor
Price index
Term (mathematics)
Functional (mathematics)
Maxima and minima
Concordance (publishing)
Partition (number theory)
Category of being
Sample (statistics)
Lattice (order)
Quadrilateral
Order (biology)
Uniform space
Resultant
Point (geometry)
Divisor
Mathematical analysis
Student's ttest
Regular graph
Theory
Number
Goodness of fit
Field extension
Root
Iteration
Term (mathematics)
Average
Contrast (vision)
Subtraction
Standard deviation
Quadratic function
Projective plane
Mathematical analysis
Weight
Set (mathematics)
Local Group
Estimator
Computer animation
Convex set
Discrepancy theory
Family
Local ring
34:24
Classical physics
Point (geometry)
Standard error
Randomization
Group action
Direction (geometry)
Mathematical singularity
Insertion loss
Mathematical analysis
Weight
Hand fan
Theory
Number
Permutation
Expected value
Stochastic
Sign (mathematics)
Root
Average
Boundary value problem
Sampling (music)
Linear map
Convex function
Point (geometry)
Gradient
Rate of convergence
Mathematical analysis
Thermal expansion
Functional (mathematics)
Sequence
Permutation
Connected space
Maxima and minima
Derivation (linguistics)
Category of being
Computer animation
Convex set
Order (biology)
Uniform space
Mathematical optimization
Square number
Resultant
Sinc function
Gradient descent
40:15
Slide rule
Statistical hypothesis testing
Computer programming
Group action
Product (category theory)
Concentric
Multiplication sign
Expression
Bound state
Insertion loss
Uniform convergence
Set (mathematics)
Measurement
Theory
Local Group
Wave packet
Derivation (linguistics)
Expected value
Frequency
Computer animation
Term (mathematics)
Average
Subtraction
Linear map
42:48
Logical constant
Complex (psychology)
Group action
Multiplication sign
Direction (geometry)
Insertion loss
Parameter (computer programming)
Mereology
Grothendieck topology
Permutation
Subset
Expected value
Stochastic
Roundness (object)
Square number
Cuboid
Optimization problem
Moment (mathematics)
Gradient
Rate of convergence
Sampling (statistics)
Uniform convergence
Bilinear form
Functional (mathematics)
Permutation
Derivation (linguistics)
Maxima and minima
Arithmetic mean
Order (biology)
Linearization
Mathematical optimization
Resultant
Point (geometry)
Slide rule
Divisor
Line (geometry)
Mathematical analysis
Average
Mortality rate
Hand fan
Number
Root
Iteration
Term (mathematics)
Selectivity (electronic)
Linear map
Standard deviation
Mathematical analysis
Bound state
Paradox
Variance
Total S.A.
Mortality rate
Calculation
Computer animation
Convex set
Noise
Iteration
Object (grammar)
Family
Local ring
Limit of a function
Gradient descent
50:38
Axiom of choice
Group action
Gradient
Multiplication sign
Modulform
Similarity (geometry)
Least squares
Parameter (computer programming)
Regular graph
Number
Mach's principle
Frequency
Roundness (object)
Iteration
Square number
Cuboid
Gamma function
Logarithm
Optimization problem
Rate of convergence
Algebraic structure
Set (mathematics)
Computer animation
Theorem
Iteration
Mathematical optimization
Mathematical optimization
Square number
52:09
Point (geometry)
Theory of relativity
Multiplication sign
Modal logic
Gradient
Moment (mathematics)
Mathematical analysis
Algebraic structure
Solid geometry
Discrete element method
Sequence
Permutation
Grothendieck topology
Number
Flow separation
Stochastic
Mathematics
Phase transition
Order (biology)
Theorem
Iteration
Gamma function
Social class
55:06
Quadratic function
Group action
Optimization problem
Moment (mathematics)
Mathematical analysis
Insertion loss
Set (mathematics)
Functional (mathematics)
Food energy
Maxima and minima
Concordance (publishing)
Partition (number theory)
Computer animation
Term (mathematics)
Square number
Theorem
Iteration
Mathematical optimization
Mathematical optimization
Subtraction
Square number
Resultant
56:53
Computer animation
00:04
the tell them around the world think session also added the word optimization because stock of being 1 focused on section for new optimization and so anyway thanks to the introduction and thanks for inviting me answer this stock is actually based on their several works some of them together with Western bunny as well as works with no discernible job so when we talk about Distributed Learning Nunavut musician we're talking about a situation where we want to do some standard turning worked musician task but where the data is it petitioned across several machines and this can be a coming various away so it can be your starting from having several quarters on the 2nd CPU through several militants on the In cluster and up to sound a huge computing grid with different machines may be a different parts of the world I am so distracted learning and optimization is something that has become very popular in recent years and there are 2 the main reasons why we we all want to consider the setting up so 1 of them is a bit when we have lots of data so as we all know we live in the era of big data and sometimes that is so large that we cannot fit it in a single machine we need to distribute data across several machines and on the learning based on that of this is to reviewing distributed distributed setting as a constraint ah but another situation where it's actually distributed as an opportunity is when you know we don't have 1 machine we have conditions so hopefully we'll be able to solve the problem Cape times faster they came
02:07
and there's some challenges when we want to do other things optimization and distributed it said but the first one is the issue of Communications soul no matter which of the scenarios were talking about whether it's in and same CPU geographically distributed computing with communication is always something which is much much slower than local area processing takes much more time to send data between machines compared to intuition doing something of locally and its on a date unusually releases them open orders of magnitude difference so generally we want algorithms which are distributed which communicate but actually communicate as little as possible because it's expensive the 2nd challenge is how do we paralyzed the computation and the challenge here is that many standard in learning and optimization algorithms are very inherent consequential in nature so if we look for instance on status to pass a sent this is based on their sequentially taking example doing some of the men another example that another update in Africa it's not clear how we take such kind of finale algorithm them and make it work in Paris the 3rd challenge is that want the result to be accurate so obviously we what don't want to suffer because we distributed the computations of the of the quality should resemble what we we could get undistributed elderly I'm so a and there are many ways to win it to win a formalized this setting sending that were the focus on is it wouldn't want to do something akin to empirical riskier minimization and when there is a function that we want to optimize its convex soul we basically want to minimize some function which we can write as an average of flights each of these F5 represent the average loss all made it 1 of the machines and want to minimize the average but an intuition in turn has entered a data points of simplicity resuming its mission and the same number of examples but everything else they can be easily generalized but in the end the functions are generally convex we can talk about a situation similar to none distributed to musician weakened and divided the scenarios where functions of media strongly convex were smooth or both I will discuss each of these scenarios later on and in terms of the communication would generally assume that our communication takes place in communication rounds where machines can broadcast to each other and the amount of communication corresponds to sending aura of debates opposition per would be that they mention so is at the machines consent to each other to each other maybe vectors or gradients but not the things like a Heston's for instance the biting waitresses and this corresponds to a word it no big deal .period highdimensional learning scenario where he can be very large and sending Cindy mighty nature says is not something out which is feasible
05:27
it and the main question is sending is how can we serve optimally tradeoff between these is a 3 requirements of immediate and accurate solution with as little communication as possible and with it as those around them as possible ideally getting a speedups by improvisations and I notice that in this stock up when we're talking about accuracy and going to focus on optimization and so our goal is to
05:54
minimize the need in this empirical risk function I could also talk about other models of for instance
06:02
you can assume that the data is sampled ID from some distribution in your goal is to minimize the expected loss or risk and and that the reserve put things in such a different perspective but there are many ways in many cities 1 can talk about here but this is the setting I will focus on a OK so is
06:27
trying to make things a bit more can create armaments in order to Scott discuss different results we need to make various assumptions on how the data was distributed to the machines in the 1st place what can we say about the relation between them if
06:42
someone scenario is when we don't assume anything so they know was there in petition in some arbitrary way maybe 1 she has a positive examples another machine has a negative examples on this
06:58
is 1 sitting up at the other extreme we may assume that the data was actually partitioned and try and so we had a bunch of data points that were just signed uniformly at random it the missions and then the situation and from the algorithm in designer perspective but is potentially improved because now there are stronger relationships between the data across machines of for instance we have fierce concentration of measure affects the values of the gradients of the local functions of the Commission machinery Leonard and as we'll see later this is something that we can utilize In another setting that is interesting to talk about which in some sense generalizes the previous 2 eyes Delta related setting where we assumed that in you are relationships between the values of the gradients off in the local functions at any point on again if the data is in petition there random them I really have this kind of situation where Delta is pretty small but you can also add discuss your more general thinks maybe there or statistical similarities between the 2 points but maybe it wasn't exactly petitioned the them so in some sense it lies between the arbitrary and random petitions in but so
08:23
basically what we're gonna do this stock is to discuss each of these areas 3 scenarios and was discussed both in the upper end in law balance mainly in terms of the amount of communication and also discuss in In the runtime In the regarding the right petition I so I think the results there are actually able to Reliance during your results which might give independent interest actually having to do with without replacement sampling suppressor gradient methods don't talk about that but also point out how it gives is something new for Distributed Learning with importation OK so let's
09:09
start with the arbitrary petitions scenario so we don't assume anything about say what's the relationship between the 2 are functions of different machines in the simplest baseline here than 1 can think of is just to reduce it to standard firstorder not distributed optimization so we can the fact that we are in In conservative scenario where just as it is an average of functions different machines but as we can just have each machine computer gradients of this state capital and this requires 1 communication around so each machine computes said the gradient of the local function in the new communication round to average I am and that gives us basically and cool to compute gradients offered begin finale can plug it into any kind of black boxes 1st firstorder algorithm for instance gradient descent I and then you get to know where the where the number of communication rounds is the same as the number of iterations of this algorithm to integrating the center can also do all
10:15
the other things that people do in the standard optimization and you can do accelerating gradient descent smoothing and so on and then gives you opera bells on a number of communication rounds and with the functions of local functions of strongly convicts and small and that the number of a communication round was skin like square root of the condition number in 1 over London functions from convicts since I cannot stock about adjusting smooth but numbers from a convex convex and so what you just drive it from the standard upper bounds for In these algorithms I am and they don't want to do this is a very nice and simple approach also almost fully paralyzed both because most of the time each machine just computes in the gradient of its own local function but it does require a relatively large number of communication rounds so when we do In an art Scalia learning problems in hiding mentions the Lundale usually comes from an explicit regularization that we add to the problem and it is for statistical learning considerations usually it's quite small and actually the case with the amount of data that you have to London is generally quite small and then you may need to do many communications rebels depending on In epsilon rental isn't desired accuracy no we thought Qwest
11:47
yes speaking to you Everything is fully
11:51
synchronized you hear from him because most of the residents of varying simple and simple baselines you can always do I know
12:01
of course as you might suspect there are probably more sophisticated things you can do when they're actually have been a lot of work in recent years in on algorithms for such a situation from medium and Coco Coco plus many others in but at least in in the worst case do they actually improve on this simple baseline and the answer may be surprisingly is no so again face in the worst case oversee all strongly comics and small functions you can't get something better in terms of the number of communication rounds at least for a very very large family of algorithms in
12:42
so basically means algorithms which fall into the following day at Temple is so each wishing implicitly has some kind of a said of vectors WG and what's the machine can do between communication rounds is our is sequentially computer factors which are basically in the linear combinations the lecturers computed so far or ingredients of the local function that point or even things like a multiplying the points with CNN's in and actually hit the port it's doesn't even have to be that the point it computes the is In the span of these things can be also a linear combination of on the point in its gradients of instance that allows us to also consider algorithms which sold some kind of local optimization problem are at each iteration of emissions can actually do this for as many as 4 as long as they want in terms of some of the more bound than going to show I am an enduring communication around the can basically share some of the matches they have completed OK so I don't know if it covers every possible imaginable the algorithm for this problem but it does in a covered kind of reasonable approaches the least I can think of the what is that going to the fully to yet this is just for technical reasons so the point is that if the government is negative it is that you you may be able to solve a local at every round local optimization problems which earned on convicts this is actually something that would break a lot but it's also in you know if you limit yourself to army algorithms which are based on comics optimization on then basically have a in infected which a positive view of the universe inside yes yeah Our and show the opera
14:55
it's actually very simple I'm going to focus on the case where we just have to machines are down and the local function mission will be just a quadratic function are only 1 of them will have little nearterm where 1 is the 1st standard unit vector and a 2 are in it to make forces which have the following form so sort of In the blonde they had lowered the blocks are in overlapping so what's the ignorance was considered the 1st machine before any communication was done us because of the bias terms it would be able to compute vector with a 0 in the 1st quarter net but then would be for communicating it won't be able to manufacture any lectures with fared in nonzero values except in the 1st quarter but once communication around the happens mission will be able to make the 2nd coordinated actually the 3rd quarter nonzero but again it would get stock it again because of the structure of this of visa Of these images is so basically the number of communication rounds limit how many coordinates are we can make a nonzero in In vectors of the machines computer but the optimal offered this problem but actually a and requires all coordinates to be a on 0 and if only the 1st few corn a nonzero that gives us a lower bound on June 18 on the Arabs after team durations here can be no less than exponential in mind over square root of 1 of London and that basically gives the lower bound for the strongly conducts in some cases as some of you might recognize if you look at how the global function looks like the average of the phone and have to do this is essentially the kind of heart function and that's it being used to prove lower bounds for In 1st order algorithms in non distributed victimization scenario I soon the In construction is in the same but here we actually make different structural assumptions so again in this and previous slides machines can
17:21
compute local has hands and multiplied with things that's fine and still logged on
17:26
hold area you can also do similar things to get results forcing nonschool functions but the basic idea is still the same but the construction is different so it without seeing this week are again create 2 functions with this kind of interlocking is structure but now it's not the quadratic form with absolute values are and get a and 1 land that Square in Lower Bound which is matched if you do this simple baseliner talked about a specifically with accelerating gradients sent with proximal smoothing Our any questions about
18:09
this OK I also
18:12
next will discuss the Delta related incident which as I said a situation where we assume again and I'm going to assume that the data was a necessary petition the surround them but still I will assume that the functions have similar values of gradients were has since on at any point in the domain N and then you can actually give law bounds very similar to the 1 shoulder earlier books where you now have these Delta factors which making the law bounds in weaker and the question is can we get up about can get algorithms which utilize a Delta related they're sending and require less communication in a way which depends on this said yes that is fighting the war of the them the head of the use of them in a well I talk just about the end equal to weigh in this scenario in the dark about more then things become a bit different visions of the festival further losses this year yes that could be these bounds they do not dependent on but yeah in general at least in this talk I think of and the number of machines has being generally a constant but today I agree that it's a good question to understand what happens when it's so did so for the Delta related setting can we get our for this but many different way to think about this question so when we have a lot of data that we distributed between the machines and a new fully have concentration of measure effects it means that the Delta actually becomes smaller and smaller so and some says this is situation where maybe by having more data we can reduce the amount of the communication that our algorithm needs because Delta would become a smaller it is so I'm going to
20:13
talk about 1 algorithm are which does have this kind of Finland's depends on an adult and 2 have been followups to Italy since that of brief mention at the end of so I never talk about is called a dangerous for distributed approximately method after for those of you know on the Indian now within the structure is very similar so it's an underrated ,comma than were each time each machine solve some local optimization problem of the following form and then missions communicate to each of the average gradients and the average age a solutions after solving other local transition problem they compute the average beer To end .period algorithm what is being tuition here so the commercial
21:05
property is that this algorithm is essentially equivalent To during an approximate Newton step so what is the Newton stepped on so if the problem is that in here is sufficiently small than we have had since but then 1 of the classical ways to do it a optimization is true it iteratively do something like the following on this has a very fast convergence could quadratic in general but but we can't implement it here because it requires us to compute and invert has since and as we said is actually computing and communicating Parsons is pretty expensive
21:48
MAN now in our is sending an equivalent which riding has seen in his and the average of the local has and what it turns out
21:59
that at least for a combatant functions Dane is equivalent to do steps which are not this for not liking it instead but it of this form so ignoring the UI for now here we have the average of the Heston's inverse worse here we have the average all the inverse the local has since but I should emphasize that the algorithm doesn't explicitly compute this has him so even if the dimensions you and you don't need to store dividing mattresses rather implicitly by solving these local optimization problems and updating this is essentially what you do if you .period realestate related solely to the problem which is how the no because In OK so
22:50
I in this local problem is that Europe has led Congress willing to bend I do have regularization here so I could do things like it's sad gloriously Argentina as the seeing less of the newsletter said that was going to be the father yes but to the conjugate gradients you the number of iterations you need to do it in the scale of the dimensions or with square root of the condition number so you can do that but the number of communication rounds would scale with the condition of don't get this improvement through the window of communication OK so a
23:32
this thing and if they are not the same because we it over the order of the universe and that the average but it .period days that's rare were talking about a situation where these past seasons or similar warranted a Delta related sending if they were exactly the same there would be no difference between this term in this term on the Delta in setting the only if they are different but just a little bit and the difference is are quantified by this said which allows us to argue convergence of guarantee but which is basically
24:08
the following federal so they did that every tour in every iteration of the shrink distance to the optimal by something which depends on their age till the minus 1 in which the wages the actual Hassan based on the minus 1 is the average of the inverse enhancements so it again if all of the local hazards are exactly the same age and is just the interests of the nation the minus 1 in the product would be just fine and then setting I data to be I actually get a convergence with just 3 1 iteration are the realistic setting where there only builds up a related to this thing want to be exactly 0 but rather something that would depend on builds up ends it In London the strong convexity parameter in the overall you get an algorithm or the number of communication rounds is the growth making the required accuracy at end the bends in the square of Delta over London so we've dealt is the smallest London it means that the number of communication rounds is just the the accuracy and independent of any everything else a
25:27
just to give you illustration of this algorithm so this is on synthetic data although we also did some experiments in realworld data Ussery we compared data may be 1 of the most popular algorithms for this problem namely Indian and the left column is for 4 machines right columns 416 machines and the different lines correspond to different amount of data which was randomly partitioned between the machines but so you can clearly see that today our as In intuition gets more and more data on the relations the relatedness of V local functions are becomes a stronger than Delta shrinks and indeed the number of communication rounds that you need here the excesses our communication rounds this as long the optimization of so the number of communication rounds it decreases by contrast stadium and doesn't utilize relate relatedness between the local functions so even if machine gets more and more than the convergence rate remains the same a they
26:37
know they were in the guarantees a talked about adjust for quadratic functions we can also provide some guarantees for longer functions but there are a bit weaker I want to discuss them I'm in In Addis said earlier there had been some followups to this work since so for instance the Jinjiang innings shall have this very nice people last year when they propose different in somewhat more sophisticated algorithm called this called for the same listening as always where they improved their dependence on the ratio between Delta and London we have built overlooking the square they have disclosed of builtup area over longer a so In interviews are very nice algorithms and 1 thing that should be kept in mind about them is that they're still not necessarily very very cheap in terms of time because we still need to solve some local optimization problem at every day and around which in practice is a little bit too expensive so in terms of the amount of communication it's generally small but injuries of runtime error it said it might not be the best in possible are another thing I should point out is that Iran it or so currently the kind of analysis that we have is before quadratic functions in for the algorithm jamming cell they were able to extend it to self concordant losses under some assumptions of something don't use or slightly weaker we still have an algorithm for the related sending which is safe for general area strongly convex and the smooth In losses place 1 where you get a good depends on Delta yes which is little more interested in your house bottles which you can see us from the square just worst of uniform of water possession of abuse was this well it in practice the difference between them is not necessarily that a dramatic I number well I'm talking about this algorithm because that's the algorithm you worked on it gives a simple anyway to take advantage of the Delta in but today and in many cases that haven't been could work better but I prefer to talk about my own work someone else's work if you knew you could get to away we just less than those of us in yet it said it's a good question so the difficulty is that both in our algorithm and their algorithm the it's said something very similar to "quotation mark Newton methods and to do the analysis correctly the over for any kind of a algorithm from the nuclear family it's very difficult to get something satisfactory far assuming something like self concordant are in practice I don't think it's really necessary in terms of practical performance these albums and you can easily run on anything but said the analysis I don't know how to do of them was his mind on the contrary most of the group's back it's nonsmoking might not even have a hacienda before the that means that the those responsible those small compromises that just won't get usually the convergence of decisions on it that's a good question and I don't know actually In practice it does work but I don't know I'm not sure if we know how to analyze it so it is the standard Newton algorithm we don't know how to analyze it in this students it's only more difficult OK moving to the last scenario will discuss it maybe the easiest in some cases where the data is in fact randomly partitioned between the machines but it is a special case of the Delta related they're sending where Delta is something like 1 of the square root of number of data points machine but they can utilize the fact that it's random petition to get even better results than in the delta related setting in so just translate the results of the previous algorithms I try to understand what is the regime where we can get a small number of communication round something which is just logarithmic in the required accuracy that the previous algorithms did that as long as the strong competitive parameter he is at least 1 of the square root of an which is OK in many cases but it is not always because again this slammed the strong convexity parameter often comes from some explicit regularization that we had and offering a indicators with the number of trade data points usually the literature and what you see something between 1 over square with the number of data Point and 1 over the number of data points so this gives us 1 extreme of this regime but also because we do want to use smaller In a regularization and then the but number of communication round is not as good as I am in contrast to the runner petitions in the ongoing discussed is actually much simpler approach it also ensure that the algorithm that does allow you a lot of 1 of rap's communication rounds as long as there is 1 over and increased up to a lot of factors so we get this nice and good behavior in terms of communication for in March broader choice of the regularization parameter in London the other 2 a explain this approach only to take a detour and talk a bit about without replacement sampling forced aggressively after which will return back to the distributed arms so we forget about the distributed setting for now we just want to look but the situation where we have some function which is the average of many individual functions I want to optimize it so it very very popular family of governments and then this is the category methods so if we do something less depressing reading the sentence of gradient method what we basically don't that each time sampled 1 of these In functions and fight out computes the gradient of some gradient and the current points to take a step along that set a direction and project that to the meeting if needed a
33:28
now the way that the standard analysis is when it is assuming that means I keys using is our sampled uniformly at random and independently from 1 to end the course because then each GTE is an unbiased estimate for a gradient off the function actually want Western might want to optimize the W but it there is actually a certain theory practice discrepancy here because in practice quite often it's much better to do sampling not with 1st placement but without replacement so if I already sampled some index I don't sample it again a different way to think about it is that I optics some permutation over the indices and uniformly at random and then just go over the data according it to that order just to random Chapter 4 of them and go over it
34:24
Our then maybe I can have been reshuffled then I go over it again and then not only works said that it is not only is it often works better in practice to get faster convergence but it's also a way off in the March easier and faster to implement because going over data sequential order because of the caching effects or if they're resides only in some it's slower device it's much faster to go sequentially men of random access it and know intuitively that it without replacement sampling works better because I sort of forcing the algorithm to process all the data equally if for example with replacement it only happens on average but determined to be very difficult to analyze these 2 agreeing methods in when we do simply in this way because now the updates are correlated idle and no longer pick having this independently from everything before but there had been a little bit of work in that in this direction so they adore classical results for incremental gradient efforts to basically that convergence bells which work in nor in which order I go over the data but I'm not assuming any kind of friend this year so the Bells are much weaker connections show that a place in some cases they can be expansion exponentially smaller slower than with replacement sampling the is very recently there was a very interesting work by good when abandons the nonpolitical which tried to analyze press gradient descent were strongly comics some the and problems In shown that if you do sufficiently many passes over the date of doing something without replacement then eventually you do get a small error so as carrying gets larger you get in the Christian which is like 1 of the key squared which is 1 of the key in the with replacement however they also have a very strong dependence on the number of the data points so it you just have to make a balanced a nontrivial you need to do at least and passes over the data if you want to be better than with replacement care has to be something like an acute on and this is a little bit unsatisfactory because the in In the case where we want to use the gas ingredient methods to begin with is when we don't want to do many passes over the data if you're willing to do many passes over the data there are actually much better methods to just plain integrating the center accelerated when in the center the fastest way to cast the methods and so in a situation where King is smaller maybe even 1 of these results don't tell us much unfortunately so I know what I'm going to talk about nexus of new results which give an analysis forced category in Athens with it without replacement sampling but our goal is a little bit more modest in the sense that we won't show that the without replacement is a stricken better but this will show that it's not worth in the worst case sent to them considering scenarios like strongly convex and some functions or convex functions we get the same kind of thing convergence rates as in the air with replacements in sampling the case I so we talk about various scenarios in the context longestrunning conflict since morning I'm also an analysis for Northwestern version of the SCO G algorithm which is the 1 that would relate later on to undistributed earnings I solely from a little bit hollow this kind of analysis works basically uses Indians from stochastic optimization but also from adversarial online learning introspective learning theory I want to assure not familiar with these don't worry I'll explain it as we go along it is OK so their simplest maybe to explain is the situation where the functions of just convex sending Lipschitz speechifying and we look at an algorithm which sequentially processes these functions according to some random permutation producing the interests and our goal is to prove that in expectation the average then in In the averages erasable optimality of interest in his order of 1 of the square root of the and that In a sign of base unless you can argue that if you pick In a single wt the teachers uniform at random expectation this will be the boundary can take the average of the W's are but this is what we this is the kind of convergence said about if we would like but and a actually
39:28
talk about a particular algorithm only would require is to have than the algorithm will have it regretted bound in the adversarial online learning sending out which is a situation where are these functions I basically don't assume any kind of a statistical assumptions about the arbitrary functions and then but maybe comics initiates and then I want that the average loss of the points wt that the algorithm produces is only 1 of square 250 worse than the average loss of any fixed .period In NEW enlightening to France since this is if this holds forced to cast a gradient descent but also other algorithms Amy and then the
40:18
perfect can basically showing me to slides I think which will have time to go over every year .period but then the following is the thing that we want abound in I'm using the fact that in this city is a signal this plantations was chosen uniformly at random so in expectation of the marginal segment is just the beginning I I added subtracted terms and it and then I applied the regret that I assume so this allows me to up about this whole thing by 1 we're screwed but this the 2nd term I ain't right a little bit said it differently and the summer this simple
41:06
algebraic manipulations and what they end up with is this city abound where I got what I have here is expected difference between the average loss of WTA all of the losses seen so far minus the average loss on the losses that is still worried according to the head mutation program and I'm going to do something which might appear to be very loose so I upper bounds of the expectation terms by the expectation of the supreme over every possible .period W in mind please so why do I do this I I do this because it this kind of thing what basically I'm asking when trying to balance this expression is I had like I ran partitioned it takes something I run runway petitioned did too it is a group of the minus 1 in a group of the rest and I ask for any given .period w how large can be the difference in the average loss between these 2 thinks so because of concentration of measure in uniform convergence effect this thing can generally be it is small and actually this is exactly the term that has been studied in productive learning theory introductory learning we have some fixed data which is splits through a training set said and many can ask what is the difference between the empirical risk of the training set VS the at risk the average loss on the test set there's been an entire theory developed exactly for these things in particular it can be
42:51
shown in you can up about this expectation of Supreme them by In trends that version of frontiermarket complexity which is used to it and yet here then OK capital and is the number of the total number of data points to end team is the number of iterations the algorithm does so I want to go there and he said yes so here I am assuming that team is less than or equal to and R I do just 1 random shuffle of did not actually everything I do here can be generalized to situation where you do repeated reshuffling the 1st stage of the disease in Yeso actually this tour will dominate this term that's OK because I just want to end up with the Bound which is like 1 of the square root of the year and this year you know it couldn't could actually equal and that's part of the site of the 1st of us yes yes we will be watching all over you out of your way to fall in the yen an analysis that allows me to do that OK so this kind of expectation of supremely Yukon upper bound by a rather at complexity the run local complexity of me in the main W and then you can apply it can instantiated for various cases so maybe the simplest wonders if we talk about in your predictors of bounded moral and the the losses of comics elections we get there 1 was cleared of the rates and actually you can also show that all the all the parameters the alteration normal the predictors and Lipschitz constant on why are all corrected get the same thing as indeed with replacement case up to In constants N and then you can also induce a more sophisticated version of this analysis to get saved 1 of London tea with funds from complexity here you don't need to work harder because the convergence doesn't give you 1 of adherents but it turns out you can do some tricks I to get around that I won't have time to go into the details Our but today start getting back to the distributed sending our focusing on In the results for the SCO G which is known to have a family of algorithms and was wasn't developed over the on past few years including by members on the audience here but which are exactly targeted at solving in optimization problems of the following year form they have cheap stochastic iterations likeliest aggressor gradient descent but their convergence rate is linear epsilon accuracy number innovations only skills the rhythmically with Epsilon and only analysis I'm familiar with are in strong views with replacement sampling of and instead consider a without replacement sampling and particularly as SVR G algorithm in Seoul algorithm the standard with replacement version has the following year it forms a very simple algorithm it goes in a box in each at Poppy computer oneforall gradients of the grid with respect to but function actually want to optimize and then we do at the it's in aristocratic iterations region big 1 individual in losses from the trend them into an object which has this form in expectation is still corresponds to the gradient of but the these trends here ensured that the variance noise that we have in this update gets smaller and smaller in overtime please In and the sentences is that basically had to do a lot of 1 of rap's at books the overall and in each had blocked a number of considerations incidentally staying 1 London Beijing now a N In a recent paper in adjacent leaning in Miami this very nice observation that actually can take this algorithm and apply basically as it is for distributed learning so Distributed Learning the individual functions in a fire In distributed across different machines but still the machines can simulate this algorithm selection window communication round to compute the full gradient and then they each machine runs the means to achieved the directions on a subset of their data now there is that it is difficult to hear because the other than requires with replacement sampling and we were talking about a situation where the video was a petition at random so that they correspond to sampling with replacement that at least as long as the number of iterations was over it said this should be a square root of defense as long as you sample left the square root of the total number of examples by the birthday paradox with and without replacement sampling is more or less the same sold this thing would work I am so when I take this constraint and plug it into the analysis it means that we get this way algorithm for distributed learning optimization however is only applicable when there is a strong convexity parameters at least 1 Mosquero defense which as we said earlier it is it's quite a bit restricted so what this sector can do is simply do the same thing in the same house them but this time used without replacement sampling which fits much more with the random partitioned they know that we deal with so it's exactly the same as in the previous slide but instead of each time picking it in individual losses in independently the effects of permutation over there and what do the update according to this permutation it's no this is something that I can actually simulate with random petition not all the way up to I order of an order of the number of data points in 18 here again maybe up to a lot of factors but it is the view is that this is the 1st of yes but the point here is that I have used it expensive Britain calculations which require full passenger data but the number of stochastic iterations and into it is actually less than they might decide data so that's the answer to that of the disqualification of us is those that ultimately this is a really smart yet so and I'm going to talk about the analysis said in a moment but it's very important that if you do several passes over the data you reshuffled the deck each time otherwise the announces doesn't work but I think this was also it was noticed that with these algorithms if you don't see every person you teach time at can maybe the converge per year or not at all N look at the basic
50:29
endowment unclear OK so this is the number that you can apply on anything but what can we say in terms of rigorous guarantees so
50:39
grandly we can give abound .period replacement as you're G R but only for a regularized squares but this is for technical reasons as far as I can discern that this is what we can currently do still it's an important setting so again using the same kind of parameter choices long 1 of Robson at box and 1 over lined up without replacements the caustic iterations will get in a similar kind of convergence rate as the with for placement a case so what this means in the context of distributed In optimization is that it at least for regularize Least squares we can do it we can find In optimal solution a took
51:24
no action to implications here so far most riveted optimization which means that if we just want to do it some replacing version of this we actually don't need to do any data reshuffling all the way up to land being around 1 over the years data I which again is good in situations where access to the data is expensive and doing this reshuffling it's not something you wanted to many times In the context of distributed optimization again where land is at least 1 of them size you get an epsilon optimal solution friendly petitioned and you only need to do a logarithmic number of communication rounds and up because of the structure of the algorithm the is actually dominated by it this
52:11
fall gradient computation which is fully paralyzed because each machine can just computed only in gradient on its socalled than I normally do in averaging at end so you can
52:23
also choose a from time you get a runtime speed up by using more missions yes so you Nunavut officials was you you shut up there was more it's just me and then you never changes traditionally young investment in at once you don't need to pass over the dinner several times but you don't need to change the order sequential passes this is the 1st in the history of cellular Iowa comes from house and I probably the slowdown of the horses and father so they too much much smaller steps and also I actually in the point is that here when it's saying we don't have data shuffling it's because the number of stochastic iterations you do not larger than data sites you do more than 1 thousand because you need major about to compute the full gradient but to do this the Catholic iterations you don't touch it .period more than once and that is important otherwise it'll have to get this country the works some of the changes should walk so in this case like to reconcile the fact that I am working with you don't change of relations news most lost sight of the values in this also there is no legal Newfoundland you have to do things that but you want to that stuff of class and also that's a good point and this is also an analysis for a smear G and don't have announced this at the moment for solid work if the CIA and the structure of that of these algorithms is also different so here I merely utilizing the fact that this algorithm uses for Britain computations in a small number of the custard iterations side as the saying to do if you don't have the exact Reagan computations that they have more stochastic iterations and so that does break from it In the analysis here because it does require you in the stochastic phase to touch admitted .period more than once yes sir if you don't and lost the use of the Sunday at once this year devastating years necessary loser you see them in the former Reagan computations "quotation mark questions OK so I think this is
55:11
this is more or less at end so to summarize it talked about 3 scenarios in Distributed Learning optimization our and gave all kinds of results in each of them so many of the 1 that we currently I think understand best actress in terms of worstcase guarantees is the arbitrary petitioner case with where it can be disappointingly the simplest baseline is also a worstcase optimal In another related settings the situation is that we can handle quadratic functions pretty well maybe self concordant losses but we currently don't have any In the algorithms for at least of political guarantees for generic are strongly convex and the functions and also the kind of algorithms that we have already had the need to actually solving optimization problem at every iteration and finally the run petition currently we have probable guarantees from the squares the algorithm I presented as recently as the energy I think I still haven't experimented on it too much but I suspect it would work generally on strongly convicts and function but we don't have an analysis for it at the moment I well OK actually it it is possible to get some kind of analysis but again it's only a situation where London is a very large is so large that there is action on the difference between with without replacement set in the Dr. replacement sampling so it's not a very interesting results so I think I'll stop you thank you very much at the plant