We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Generalization properties of multiple passes stochastic gradient method

00:00

Formale Metadaten

Titel
Generalization properties of multiple passes stochastic gradient method
Serientitel
Teil
10
Anzahl der Teile
10
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
The stochastic gradient method has become an algorithm of choice in machine learning, because of its simplicity and small computational cost, especially when dealing with big data sets. Despite its widespread use, the generalization properties of the variants of stochastic gradient method used in practice are relatively little understood. Most previous works consider generalization properties of SGM with only one pass over the data, while in practice multiple passes are usually considered. The effect of multiple passes has been studied extensively for the optimization of an empirical objective, but the role for generalization is less clear. In this talk, we start filling this gap studying the generalization properties of multiple passes stochastic gradient method for least square regression in an abstract non parametric setting. We show that, if all other parameters are fixed a priori, the number of passes over the data indeed acts as a regularization parameter. The obtained bounds are sharp and matches those obtained with other regularized techniques such as ridge regression.
DistributionenraumÜbergangswahrscheinlichkeitFunktion <Mathematik>ParametersystemLokales MinimumStichprobeKategorie <Mathematik>MomentenproblemPunktObjekt <Kategorie>MatrizenmultiplikationInnerer PunktGerichteter GraphMinkowski-MetrikResultanteOrdnung <Mathematik>FunktionalRechter WinkelFunktion <Mathematik>IndexberechnungEinflussgrößeWiderspruchsfreiheitGesetz <Physik>EntscheidungstheorieKlassische PhysikStatistikDimension 3ModulformLineare RegressionErwartungswertGewicht <Ausgleichsrechnung>KonditionszahlMathematikGrothendieck-TopologieGlobale OptimierungVollständiger VerbandElement <Gruppentheorie>Lokales MinimumSpezifisches VolumenFamilie <Mathematik>Eigentliche AbbildungBinärdatenMengenlehreAnalytische FortsetzungProdukt <Mathematik>AuswahlaxiomPhysikalisches SystemParametersystemNumerisches VerfahrenEinfügungsdämpfungForcingFundamentalsatz der AlgebraArithmetische FolgeAnalysisFrequenzAbstimmung <Frequenz>Regulärer GraphStrömungsrichtungEreignishorizontTermSchätzfunktionNormalvektorKartesische KoordinatenÜbergangswahrscheinlichkeitStochastikGradientUnendlichkeitDimensionsanalyseFinitismusGebundener ZustandVariableHeuristikGradientenverfahrenExistenzsatzStichprobenfehlerModelltheorieEndlichkeitSkalarproduktKonvexe MengeNichtlinearer OperatorRauschenEndlich erzeugte GruppeStichprobenumfangAusgleichsrechnungComputeranimation
RandverteilungPunktTheoremMengenlehreLokales MinimumIterationParametersystemPunktKlasse <Mathematik>Gebundener ZustandMultiplikationRegulärer GraphAuswahlaxiomZentrische StreckungAnalysisSchätzfunktionExponentEinsAbstandNormalvektorLokales MinimumResultanteVarianzStichprobenfehlerTermApproximationStichprobenumfangArithmetisches MittelKlassische PhysikBeweistheorieSterbezifferEigenwertproblemFolge <Mathematik>DifferenteBasis <Mathematik>Kategorie <Mathematik>SummierbarkeitKonditionszahlExistenzsatzGerichteter GraphUngleichungWahrscheinlichkeitsmaßVollstetiger OperatorIndexberechnungPaarvergleichMinkowski-Metriksinc-FunktionVektorraumNichtlinearer OperatorHelmholtz-ZerlegungLinearisierungStochastische AbhängigkeitIterationLeistung <Physik>Ordnung <Mathematik>Spannweite <Stochastik>Globale OptimierungPhysikalisches SystemBerechenbare FunktionEinflussgrößeAuflösung <Mathematik>FrequenzForcingGewicht <Ausgleichsrechnung>Natürliche ZahlOrtsoperatorWärmeleitfähigkeitUnrundheitStützpunkt <Mathematik>Quelle <Physik>GeradeVorlesung/Konferenz
Funktion <Mathematik>Ordnung <Mathematik>DifferenteKonjugierte-Gradienten-MethodeGerichtete MengeStichprobeIterationEinfacher RingGradientInnerer PunktParametersystemLokales MinimumMultiplikationsoperatorAlgebraische StrukturQuaderAdditionProzess <Physik>ZeitzoneFunktionalApproximationKlasse <Mathematik>Objekt <Kategorie>GeradeArithmetisches MittelKombinatorVollständiger VerbandPunktResultanteOrdnung <Mathematik>Kategorie <Mathematik>Numerisches VerfahrenGruppenoperationKonditionszahlIterationLineare RegressionSignifikanztestLeistung <Physik>TopologieMengenlehreStichprobenumfangGradientStichprobenfehlerFrequenzEinflussgrößeGüte der AnpassungAnnulatorKlassische PhysikSpieltheorieKomplex <Algebra>RauschenCoxeter-GruppeGlobale OptimierungWellenpaketTermRechter WinkelParametersystemStatistikKurveStochastikMultiplikationRichtungGradientenverfahrenMereologieLokales MinimumFigurierte ZahlRegulärer GraphKonvexe MengePrädikatenlogik erster StufeHelmholtz-ZerlegungLoopUnendlichkeitPaarvergleichKonkave FunktionStabilitätstheorie <Logik>Zentrische StreckungHorizontalePhysikalische TheorieSummierbarkeitComputeranimation
Lokales MinimumWiderspruchsfreiheitTheoremKonjugierte-Gradienten-MethodeGradientSignifikanztestStochastische AbhängigkeitGlobale OptimierungUnendlichkeitMatrizenmultiplikationGrothendieck-TopologieSterbezifferKonditionszahlGewicht <Ausgleichsrechnung>Rechter WinkelApproximationPaarvergleichDimensionsanalyseMengenlehreGrenzschichtablösungOrdnung <Mathematik>WiderspruchsfreiheitDifferenteArithmetisches MittelResultanteNumerisches VerfahrenUnrundheitSchlussregelErwartungswertPunktAnnulatorGrundraumStatistikUnendlichkeitExistenzsatzDruckspannungPerspektiveFunktionalsinc-FunktionEinflussgrößeFrequenzStochastische AbhängigkeitChi-Quadrat-VerteilungSortierte LogikModelltheorieKombinatorAdditionAuswahlaxiomMinkowski-MetrikGerichteter GraphTermLinearisierungStochastikÜbergangswahrscheinlichkeitGlobale OptimierungGebundener ZustandLineare RegressionKonkave FunktionTrigonometrische FunktionAnpassung <Mathematik>RechenschieberSummengleichungZweiAnalysisRegulärer GraphGammafunktionBeweistheorieKonstanteExponentStichprobenumfangFinitismusIterationGradientParametersystemSchätzfunktionComputeranimation
Konjugierte-Gradienten-MethodeGarbentheorieModelltheorieGlobale OptimierungIterationLokales MinimumEinfacher RingKonjugierte-Gradienten-MethodePunktMultiplikationsoperatorMereologieStatistikGerichteter GraphGlobale OptimierungKonditionszahlDifferenteMomentenproblemGradientAnalysisInterpolationSterbezifferNumerisches VerfahrenOrdnung <Mathematik>Offene MengeElement <Gruppentheorie>Formation <Mathematik>MinimalgradNatürliche ZahlFlächeninhaltArithmetischer AusdruckSignifikanztestParametersystemNichtlinearer OperatorErwartungswertEntscheidungstheorieArithmetisches MittelFehlerschrankeSummierbarkeitArithmetische FolgeFunktionalStabilitätstheorie <Logik>BeobachtungsstudieÄquivalenzklasseInelastischer StoßRechter WinkelTermPhysikalischer EffektWellenpaketKombinatorFrequenzResultanteGradientenverfahrenWasserdampftafelVorzeichen <Mathematik>ÄhnlichkeitsgeometrieRegulärer GraphKovarianzfunktionIterationBeweistheorieRechenschieberNormalvektorEinfügungsdämpfungStochastikKonstanteSortierte LogikParabel <Mathematik>Lokales MinimumStörungstheorieModelltheorieAnalogieschlussVarianzGammafunktionMultiplikationDerivation <Algebra>DifferenzkernMengenlehreStichprobenfehlerKonzentrizitätUngleichungStochastische AbhängigkeitBerechenbare FunktionRichtungTrennschärfe <Statistik>DreieckApproximationComputeranimation
Lokales MinimumStandardabweichungKalkülTheoremGlobale OptimierungGrundraumResultanteGlobale OptimierungStichprobenfehlerNichtunterscheidbarkeitOptimierungArithmetischer AusdruckFokalpunktKonjugierte-Gradienten-MethodeSondierungPunktAuswahlaxiomAnalysisRauschenInverser LimesGruppenoperationEinfügungsdämpfungModulformKategorizitätMinimalgradMengenlehreParametersystemMultiplikationMultiplikationsoperatorRelativitätstheorieMathematische MorphologieArithmetische FolgeInverses ProblemKonzentrizitätMaßerweiterungStandardabweichungKonditionszahlSterbezifferKategorie <Mathematik>TermKonstanteSummengleichungPhysikalische TheorieOrdnung <Mathematik>Regulärer GraphHeuristikRandomisierungStichprobenumfangSummierbarkeitModelltheorieFehlerschrankeGradientKonvergenzgeschwindigkeitMatrizenmultiplikationHorizontaleKreuzvalidierungSignifikanztestGerichteter GraphIterationLinearisierungGüte der AnpassungDistributionenraumUnendlichkeitWurzel <Mathematik>ComputeranimationVorlesung/Konferenz
Diagramm
Transkript: Englisch(automatisch erzeugt)
So thank you for the introduction and the invitation. So okay, we have learned
during this workshop that large-scale learning, so to deal with the large-scale learning problems, statistical aspects and optimization aspects has to be treated jointly. One example of this is the stochastic gradient method
that today is an algorithm of choice for these kind of problems. So our work started from the observation that there are several results studying generalization properties of one path, stochastic gradient descent.
But in practice, very often, other heuristics are used such as multiple passes over the data and usually combined with early stopping. So what I'm going to present today are some theoretical results about these kind of heuristics,
and in particular, the generalization properties of such kind of variants of SGD. I will call it the randomly stochastic gradient method, stochastic gradient descent, so SG and SGD. So I start from the problem setting. We consider a linear regression in
an abstract Hilbert space with the square loss. So the choice of the square loss is critical in our analysis. All that I'm going to say heavily depends on this choice. So probably I will forget to say this again during the talk,
but so square loss plays a crucial role. So the objective in learning or in stochastic optimization is to minimize a functional E, which can be written as an expectation. So in this case, again, the expectation of the square loss, which depends in our setting linearly from W.
So the problem here is the fact that the measure row in general is unknown. But we can only access to a finite number of points that are called the examples, and so I have access only to
these points that are independent and identical. So I randomly sample from row. So this setting naturally includes the linear regression with Gaussian noise in RD. So if I fix H to be RD, I can generate data, for instance,
by randomly sample points xi in RD, and then by measuring some linear measurements corrupted by Gaussian noise, delta i. But also, this framework includes, for instance, functional regression if my input points belong to an infinite dimensional space, for instance, our curves or functions.
Also, let's say it's an equivalent way to write the learning problem in reproducing kernel Hilbert spaces. So if you can see the learning in an RKHS,
then what you want to do is to minimize, again, this risk where this time, the function W that I want to find, so the element in HK is a function now, depends non-linearly from the input variable. But as I think most of you know,
if W is a function in the reproducing kernel Hilbert space generated by a kernel K, then we have the reproducing property. Therefore, W evaluated at the point can be written as a scalar product between two elements in our Hilbert space H. Okay. So xi here is like x.
Exactly. So xi are the inputs that you will see in a moment why I don't call it x. And so the xi can be seen as the input space, R is the output space, okay? And W is a function mapping the input into the output space.
So what I can do is to consider a function that identifies the input point xi with the kernel evaluated at this point xi and leave y unchanged. And then if I consider these change of variables inside these integral functional, what I can write is that I use first the kernel trick.
So W of xi is the scalar product between W and K xi, and then I use the change of variable and then back to the model I'm presented at the beginning. Okay. So in this setting,
classical approach is Tikhonov regularization. So what I want to do in the rest of my talk is to start from some very classical and basic results about Tikhonov regularization for learning problems. And then I will state some assumption and then state the main results,
and then I will compare them with the one that we get for stochastic gradient descent. Okay. So what is the Tikhonov estimator, the regularized empirical risk estimator? Okay. It's a point in my space H that can be built in this way. So I first replace the risk,
which is unknown by its empirical version, and then I add to this risk a regularization function, R, that in this talk will be the square of the norm. Then what I do is that I
minimize these regularized empirical risk, and I build this estimator W hat, which depends from a parameter, which is the regularization parameter that balances the weight of the regularizer against the data fitting term. So the properties of this estimator has been studied a lot, but mainly from a statistical point of view.
So now I want to present to you some statistical properties of this estimator. In order to do that, I need to introduce some assumptions. And so the first one is a fairly standard assumption,
basically requires that my measure row is as a bounded support. So all my point axis, the norm of my point axis is bounded almost surely, and the same holds for the output Y. So the second one is this assumption can be relaxed for what I'm going to say,
but I will consider this simpler setting. Another assumption that I do in this talk is the fact that the risk as a minimizer. And therefore, since it has at least the minimizers, the set of minimizers is non-empty, and they can consider the minimal norm solution.
This is a less standard assumption. It's a particular case in learning, that is I will call the attainable case. So where you have exactly one target parameter to estimate. So under these assumptions, you can already prove some consistency results of the Tikhonov estimator,
but more assumptions are needed if we want to have a more precise bounds so non-asymptotic bounds for the error. So the existence here is just because you're perhaps infinite dimension, right? If you're in finite dimension, if you are bounded in this case, you will have always a solution, right?
In infinite dimension or in finite dimensional? So yes, bounded sets are compact. Therefore, you have a convex operator. Yes. Convex continuous operator, the risk in this case it is, and therefore you have a minimizer. Because when you use the infinite dimension
anywhere else in the talk in the future, can we always take each type you say about something that we're fine dimension and W is just- Yes, you can. You can. Yeah, yes. Okay. As a correct, so you know something like the infinite dimension but you see that the node of W star is large.
Yeah, it can be. Yes. So large that infinite, that is finite. It is finite but you don't have maybe an estimation for instance. You look. No, sorry, my comment was the other one. So my comment was if you don't assume that W is in the set, it's like having W star being
very large in a finite dimensional set. Here, you assume that W star is finite, and the norm is relatively small. I see. It is a standard organization. So some of the results, so I'm presenting the results in these settings. Some of the results that I will present all even without these assumption,
but in this talk, I will assume that this holds. So mainly because I want to introduce this source condition that is the technical condition that allows to get these error bounds. So first, I define T, which is the second moment operator. So it's an operator linear from H to H.
And these well-defined bounded C thanks to the boundedness assumption that I just assume. Okay. What is this source condition? So it can be read as a regularity assumption of the solution I am assuming to exist.
And more precisely, it asks that the solution, the minimal norm solution belongs to the range, not only to the range of T, but to the range of T at a certain power. Okay. So I will use this strange indexing just to
make easier the comparison with the existing result, because this is the usual scaling that is used in the available papers. Okay. What does it mean? So with a picture, we have the space H, if R is equal to one alpha,
I'm not asking anything since T to the zero is the identity. Therefore, I'm just asking existence of the minimizer. If R is bigger than one alpha, then these spaces that are vector spaces, sub-spaces of H are nested. And as R increases, my condition is stronger.
Okay. So as R increasing, I'm requiring more regularity to my solution. Why do I call it regularity? Here is another interpretation. So the operator T is a compact operator from H to H self-adjoint.
Therefore, I can diagonalize it and build a sequence of eigenvectors and eigenvalues. Okay. So my point W dagger belongs to H. This is a basis of H, let's say, and so this quantity is finite. So this is equivalent to existence of W dagger.
What does the source condition means? So it asks that since W dagger is in the run of T, then H, I can invert this operator, and I can write H this way. Therefore, H belongs to the capital H.
And here, what I can do, I can compute this quantity using these basis of H and I get this condition. So I get that since H is as a finite norm, this sum here is finite. What does it mean? So remember that these eigenvalues are going to zero. We have a compact operator.
Okay. So this inequality here asks that these coefficients, these Fourier coefficients, go to zero sufficiently fast. Okay. So faster than usual. So this is the usual and this is the stronger assumption.
Okay. So with this condition, so boundedness plus source condition, we can prove the following error bounds on the Tikhonov regularized estimator. So the key point here is the choice of the regularization parameter.
As you see, I can prove that I have some bounds if I choose the regularization parameter as dependent on n in this way. So in a way that depends on this source condition. What I get here is that if R is smaller than three-halves, then I have this rate that increases as R increases.
But then at a certain point, this rate doesn't increase anymore and it stops and is minus one-half. So if R is bigger than three-halves, even if I'm requiring a stronger regularity on my solution, I will not be able to approximate it in a faster way.
Okay. So what's the idea? Just the idea of the proof is a bias variance trade-off. So what I do here is I introduce an auxiliary point, W lambda, which is the minimizer of the risk, true risk plus regularization. So I will not have access to this point, but I don't need to compute it.
I just use it as a reference point. And then I can decompose the difference of the norm between my estimator and my true parameter W dagger as the sum of, I can bound this norm with the sum of these two terms. And as you can see here, we have a term that depends on the sample
and the term that depends only on the choice of lambda. So on the regularization properties of my approach. So balancing the two terms, since the behavior of this term is determined by the choice, by the source condition, we will have the result.
So these results, the results that I showed are minimax. In this sense, if you fix a class of probability measures, and in this case, the class of probability measures that I fix are the ones that satisfy the assumptions I just introduced.
Then, okay, I will have a different solution for each, in general, different solution for each probability measure. And if I fix an algorithm, I can compute the distance between my algorithm and my estimator and my true parameter. Then I can do worst case analysis
in the sense that I put in front of it a maximum. Therefore, I take the problem on which my algorithm behaves worst. And then among all the algorithms that I have at my disposal, I take the best one. So I take the algorithm which has the best worst behavior.
Okay, and as you can see here, I don't know if you remember the exponent in the bounds for the Tikonov regularization. You get the exponent to this n is the same. Therefore, the, yes. So is there a reason why you're choosing to do this in that Hilbert norm rather than the two norm? Okay, yes.
So the reason is the following. The results that we obtained for multiple passes SGD are optimal minimax in the H norm, but are not, we still, let's say, be optimistic, still don't have optimal results for the error. That's why I'm going to present these results. But basically, you can do the same replacing
the norm with the error and the r minus one half with two r. Okay, and you get the same. Okay, so what happens, what I told for r bigger than three halves is the fact that you do not, even if you're asking more to your solution, you don't see this regularity.
And this is called as saturation, so Tikonov regularization as a saturation effect. And the problem practice is the fact that you do not know r. So you do not know how to choose the regularization parameter, and so usually you need some adaptive results.
And in this case, since we're dealing with the H norm, you can use, for instance, Lebski, also known as balancing principle, that allows to recover these optimal rates. Okay, so finally, if you assume some
further assumptions on your operator t so that more precisely the eigenvalues decay at a certain rate, then you can improve the rates that are called capacity dependent rates. So that depends on this decay of the eigenvalues and are more precise, more, let's say, tailored to the properties of your operator t.
So in the talk, I will consider the capacity independent setting, okay. But there is a but, so all the analysis that I showed to you does not take into account the optimization error. So there is a, okay, so what can we do?
So someone did it already. So there is a, in the paper by Butuebo Square, we can, what they suggest is the fact that if we are in the large scale scenario, computing these estimator w at lambda can be costly. And so this estimator is the solution of a minimization
problem and does not come automatically out of your computer, but you will need to approximate it, to compute it. And so what makes more sense in this scenario, since this will be a bottleneck, is to analyze the behavior of, let's say, an approximation of the true minimizer, okay.
So I will denote it with the lambda and t, meaning that t is the, so that this w hat lambda t is the outcome of the t iteration of an iterative procedure applied to the regularized empirical risk. So to the classical bias variance decomposition, we add another term, which is an optimization term.
So the first obvious, let's say, comment or consideration that one can do looking at this decomposition is that when we use an optimization procedure in order to optimize the regularized empirical risk, it makes no sense to go beyond the statistical accuracy.
Since it will be not, so it will be lost. The additional effort will be lost. Okay, so what can you, okay, so on one hand, this decomposition, so parallel to this observation, let's say, in the last year.
In the last years, there have been very active research on optimization methods to solve problems that have exactly these forms, or that can be interpreted as, let's say, a regularized empirical risk function. So that have the structure of, have some structure and maybe are strong, possibly are strongly convex or not.
And especially in the large scale scenario. So there has been a huge amount of work on methods that scale to these dimensions, so our first order. And in a sense, are not blind with respect to the structure of the function we optimize.
So take into account that we are minimizing the sum. So I'm thinking to both first order method, but splitting, stochastic, incremental, aggregated. Okay, so all these class of methods. So what we can do once that we have the convergence
properties, so the complexity bounds on these optimization methods, so we can think to balance these two terms and to obtain some new trade-off that include also the number of steps that are needed in order to approximate my two parameter. What we can also do, and it's what I will do now,
is to instead take another direction and to abandon this decomposition, trying to have one that does not split, let's say, the statistical part from the optimization one. Okay, so stochastic gradient method is maybe the algorithm
that does this job. So the idea is that we forget the empirical risk, and we try directly to minimize our true risk, so the expected error. In this case, and with what? With a stochastic gradient descent method,
since we do not have access to the true gradient, since our measure is unknown. So one possibility is to use this simple version where each step is determined by, so each stochastic approximation of the gradient is given by a single sample point.
And so in this setting, there is no explicit regularization. We do not have explicitly lambda appearing. And we have also several theoretical results. So under various assumptions that establish some convergence results of this iteration.
Towards the minimizer or the minimum of the true risk. But what's the point here? So the point I want to make is the fact that in practice, very often, multiple passes over the data are done. So I'm, of course, considering a situation in which I know how many data I have. So it's not a truly online situation,
because I know the horizon. I know how many data I will see, and in which I have the possibility to visit them more than once. But this is done in practice. And so our idea was to try to explain from a theoretical point of view why this work. Okay, so we focused on the simplest version
of a multiple passes stochastic gradient, which is a cyclic version. So I rewrite the iteration in order to make some points. So then, as you can see here, so okay, if you look at multiple passes stochastic gradient,
this can be immediately interpreted as a minimization procedure for the empirical risk. So this is well known. If we visit each point infinite times, we will converge to the minimum of the empirical risk. Then the other observation that can be made is the fact
that each step is a gradient. So it's a stochastic gradient step, as before. And why I use this strange decomposition with an inner loop and an outer one, because I wanted to keep this number t, to put in evidence this number t,
which is the number of passes over the data. And that is usually called epoch, okay? So the main question here that I tried to answer is, how many passes should we do?
How many times should we visit our data, in order to minimize the true risk? So what's the point? So as I told you, there is a convergence of the iteration towards the minimum of the empirical risk is well
established. So we're starting from Bertsekas, but also before Kajvoronski, Polyak. So there are many, let's say it's classical. But what we would like is to prove this kind of result. So what's the point? And the idea is to exploit stability properties
of our stochastic gradient and early stopping. How? So we first introduce an auxiliary iteration, which is the gradient descent applied to the true risk, wt.
As you can see here, I fixed the step size, which is gamma over n. Gamma is a constant, n is the number of points. And I write the gradient descent in this strange way, in order to make the comparison easier with the incremental method. So wt, if you want, wt plus 1 will
be the results of t plus 1, let's say n t plus 1 iterations of the gradient descent on the true risk. So what happens? We know that it's a gradient descent applied to the risk. This iteration will converge to the minimizer of the risk.
On the other hand, what I have is that I have my multiple passes, stochastic gradient, applied to the empirical risk. And I will prove that, for a certain time, these two iterations are close to each other. But then, after a while, this iteration deviates from my goal and goes
towards the minimizer of the empirical risk. So the job here is to detect, let's say, this zone where the approximation of my true objective holds. OK? So I'm a bit confused here, because if you're using
fixed step size and you're not averaging, then you're not converging. So OK, yes, it's true. So here, I cheated a little bit. The step size, note that this, so here, this one is converging. Do you agree? No. So this is a true gradient applied to a convex functional.
So I fixed step size. And it's fixed. Oh, there was no xi, I see. OK, OK, OK. Oh, sorry. Instead, this one, on this one, so it's not very true,
let's say. It's true in, sorry, in function values, for sure. But here, we do not assume strong convexity. So that's a subtle point. But it's not, I think, important or essential for the presentation. OK? So before giving the theoretical results,
let's see what happens in practice. So what we did here is a very simple simulation. We generated some, so it's a linear regression problem. We generated some random points uniformly in rd and some noisy measurements, yi. And then we divided these points into a training and a test set.
And we applied this incremental stochastic gradient. So what happens is that if you look at the training error, so more or less is decreasing, we know that since it's an incremental, it will not be a descent method. But this converts towards the MIMO.
While if you monitor the test set, what happens should be a good approximation of our true error, expected error. What happens is that after some iterations, these curves start to go up, to go up again.
And therefore, we start in, let's say, an overfitting region, right? And we should better early stop our iterations here instead of going until the end. How many points here? So yes, I don't know. I tried to, so it's an old figure. I think that here should be, I think it's around like 30,
50, less than 100. So and this is the number of iterations. So it should be, I think, could be, I think that the right answer is 30 points. And this is the true number of iterations. So here is really the first epoch. So the first epoch is very effective here. And then it starts decreasing a little slower.
And I guess, do you know what function you are, like what r is for this particular simulation? Sorry? What function or what r? So it's a linear function, linear regression. And I don't know what is d. So it was really just a plot to show the behavior.
But it's a linear function in some rd. So xi are in rd and yi are in r. So here, there is an example. There should be another point of view, another perspective on a similar example, not the same. So here we have, these I know better because it's more recent.
So here is, we try to approximate, is again a linear regression problem, but it's a function that can be written as a linear combination of trigonometric functions. So here we have 40 trigonometric functions. And if I'm not wrong, I don't know exactly point 30, 40.
Okay? So this is the result of what happens after the first epoch of incremental gradient. And that's what happened after the 10th. And that's what happens after the 100th epoch. So that's the idea. The idea is to stop the iteration in the middle in order to achieve, let's say, a reasonable approximation,
so a regularized approximation of our points. Okay. So the result, if we assume, so only boundedness and existence of a solution without assuming
any source condition, what we can prove is the following. So fix the step size. As you can see here, the step size depends on our boundedness constant, our constant, unknown in general. And so what happens is that we can prove that this estimator is universally consistent,
so it converges almost surely to the true one if we assume two things. First, the number of epochs goes to infinity as the number of points grow. And second, it goes to infinity not too fast. So this can be interpreted.
So yes, I have the comments. So the comments that we can make on this statement are three. So this step size is fixed, okay, is a constant. But I recall you that is divided by n in the implementation. And so what this says is that, on one hand,
early stopping is needed to achieve consistency. So this can be interpreted as an early stopping rule, since the number of epochs should not go to infinity too fast. But on the other hand, we need multiple passes. So what these results say is that, with our analysis, with this choice of step size, you need multiple passes.
There are single passes, but you just have a single pass. I know that's basically just the exact same path. OK. I hear right. Just wait. Is it needed? I need it. Yes, it's needed with this. So what I'm saying is needed with this choice of step size. OK? So yes, just a brief comment on this.
So what we proved is that if you take a very small step size, so gamma over n, and you do multiple passes where multiple passes is of the order of this quantity here, then you have consistency. On the other hand, what you can prove is that if we do one pass stochastic gradient method,
and you choose a larger step size, and you choose just one epoch, then you will achieve the same result. So here, again, is the same imprecision that I mentioned before. So I'm stating this for the iterates. So for the w's, this is not known, I think.
So the idea is this one. If we care about the error, the comparison is this one. So why multiple passes make sense? Why should I do smaller step size and visiting the data more than once instead of doing only one pass with a longer step size? So I try. I hope I will convince you that this makes sense
in the following slides. So that's where the source condition comes into play. So if we want finite sample bounds, we need to assume a source condition. And in this case, we can still prove high probability.
We can prove convergence with high probability with this rate here, exponent here, if the number of passes over the data depends on our, is proportional, let's say, not proportional, but depends on the regularity of the source solution.
So what does some comments, again, what does it tell us? The fact that, first, the rates are minimax, again, so are optimal, the same that I showed at the beginning for Tikhonov method. Good news, there is no saturation with this method.
So r, even if r goes to infinity, so as r goes to infinity, this rate is improving. So it's increasing until the square root of n. And the stopping rule, so what happens
is that the only thing that I have to tune, let's say I need to be adaptive, is the stopping rule. Therefore, the number of passes over the data. So also here, as in the Tikhonov case, I can get adaptive results by using, for instance, again, a balancing principle.
So now I'd like to do a second round of comparison with one pass stochastic gradient. So in order to compare with the existing results that, let's say, are in the same setting as ours,
so in an infinite dimensional setting, taking into account the source condition, I add this, let's say, additional regularization term in order to compare with all of them. OK, so the theory, as I said at the beginning about stochastic gradient as a long history, so convergence in finite dimensional case
for strongly convex function is classical and goes back to Robbins and Monroe. And there have been many other developments later. Instead, in infinite dimensional case, and specifically in reproducing kernel over space with the score loss, I would say
that the analysis starts from these paper by Smale and Yao in 2006. So there are three papers on which I think it's useful to compare, with which is useful to compare our results. And there's this paper by Iminging and Pontil,
this by Taras and Yao, and this by Francis and Aymeric Dileve. So OK, first, I say first things that are less relevant, not less relevant, but less relevant for what I'm going to say later. So first, this is the only paper
which is able to obtain capacity-dependent bounds, so the same that hold for T-con of regularization, so that is able to take into account the effective dimension of the problem, so to interpolate between the finite and infinite dimensional case. The other two cases instead are in the capacity-independent
rates as ours. And as you can see, there is a difference. We have some methods which have saturation. For instance, this one or this one, and this one by Iminging and Pontil that does not saturate. And there are different results in expectation
and with high probability. The only one with high probability is this one as ours. And usually, the analysis to obtain high probability bounds is more involved. So now, I want to say the only comment that I think is really relevant here. So as you can see here, also in this case, this is true, by the way, and all these papers
obtain optimal rates, minimax rates. So they are indistinguishable from a statistical point of view, all the methods that I presented. So the point here is that in order to achieve these rates, both in all these papers,
the step size depends on the source condition, on the parameter r in the source condition. Here, we have only gamma, also here, and we have both gamma and lambda. So the new picture is the following. So we now, I think-
Is it also the case in the third paper, the one from- Yes, also here, the gamma depends on the source condition. OK, so here, we have the new picture, let's say. Basically, we have two regimes in which we can achieve minimax rates. One is the one which you take a small step size, universal,
so not dependent on the source condition, and the number of passes on the data that depends on your source condition. The other one is the case where you do one pass with a bigger step size, and you do only one pass, possibly with averaging.
So I think that in both cases, what I want to say is that model selection is needed. And in order to achieve minimax rates, here, you have to select the right step size. And here, you have to select the right number of passes over the data. So from this point of view, what I'm claiming
is that our method comes as a natural approach, in the sense that what you can do, for instance, is to have your training error and keep a test, and just monitoring online the behavior of the test error, and simply stop when overheating happens.
So the main difference is who plays the role of the regularization parameter. Here is the step size. Here is the number of passes. And what I'm saying, and I repeat, is that being adaptive on the number of passes over the data
is natural and can be done. Is it sort of too far a leap to say that all that really matters is the sum of the step size being kind of the same whatever approach you use? Because I guess the sum of the step size.
Yes. So here, all the analysis that I have in mind with a constant step size, let's say constant divided by n. But yes, for instance, for the approximation error, so that is the optimization error here, what really matters is the sum of the step size. And the same, I believe, so it's certainly the same for the sample error.
So if you use different step sizes, as long as the sum kind of stays the same, you'll get opt or? So the point is that from our analysis, you can derive some convergence for this case, so for t star equal 1. So if you take into account the gamma, and if you take gamma depending on the source condition, so you can. But the point is that we do not get optimal rates.
And yes, we do not get optimal rates. And the step size that are derived from our analysis are too small. So an open problem, a future, let's say, direction of research would be to understand if there is an analysis that allows to interpolate between the long step size one
pass towards the short step size multiple passes. Yeah, because here to mean back to the triangle of computation statistics and something else. So here, the multiple passes, if you actually count the gradient cost as expensive, it's much more.
Yeah, it's true. It depends what you have to do here in order to select the right step size. So probably you will have to do multiple passes, maybe on less data, just to adjust the step size. I don't know, to, yeah? Why not doing like a constant step size?
I'm sure you can derive similar results with gamma t equals gamma, and with the early stopping, which should be much earlier. So gamma constant. So in our case, at the moment, we are not able to do it. Yeah, it would be great. It's always convergent. So if you do a single pass with a fixed gamma, always convergent, it will be very slow.
With multiple passes, you must reduce the variance, maybe. No, I don't know. Only because you know in fact a lot. I don't know. You need smaller step sizes. Constant is too big. Yeah, because as you can see, if you look at the, let's say, sample error, so the stability, it increases with gamma.
Therefore, I don't know. No, but I think that you need like a smaller step size. So you can only get smaller and do multiple passes. It cannot be bigger if we need to do less than a single pass. Exactly.
Yes, it's exactly like this. So we can prove that from our analysis, you can derive that you have convergence for t star n equal one, and your gamma should be a gamma of n increasing. I don't know if this answers the question. So just one last comparison.
There is also gradient descent. We know that gradient descent on the empirical risk with early stopping works. And what I can say is that from the computational point of view, and both from the computational and the statistical point of view, there is apparently no difference between multiple passes, stochastic gradient,
and the full gradient. So this is one point to be understood. And I have to say that is one point to be understood also in the optimization scenario in this example. You mean same comparison? No, no, because gradient descent is being more gracious, no? So you can prove that the stopping time is the same.
So if you, let's say, do the mapping, so one epoch equal to one pass, is exactly the same. Same stopping time, same step size. You can use a constant step size with batch mode. You can use a constant step size for batch mode problems.
Yeah, but it should be smaller. I think that the Lipschitz constant of the empirical difference shows you have the 1 over n appearing. So if I say correctly, what you're missing is the capacity dependence? Yes, for sure. This one, we are not, at the moment, we don't have that. Seems feasible if you combine what you did, but it might give to combine and to get like a two.
Could be nice, yeah. So on this, as I was saying, this is an open problem also in optimization. It's not clear what is the advantage of an incremental gradient method applied to a sum of parabolas instead of a gradient method. But what is known is that the empirical observed, at least,
and some, let's say, almost theoretically justified, is that at least when you are far from the minimizer, at the beginning of the iteration, one pass of incremental is more or less equivalent to one full gradient pass until you do not enter in this confusion region where
you're close to the minimizer. But the minimizers of your summands are away, so push you away from the true one. So this would be, again, could be one reason why, from empirically, the incremental gradient has a better performance than the full one. OK, so just maybe some idea of the proof.
So just to see a trade-off, a new trade-off. So I recall you that I consider I'm comparing my iteration with the true gradient one. And so the bias variance trade-off this time is obtained by bounding this quantity here
with the difference between our, let's say, in multiple passes stochastic gradient iteration with the gradient on the true risk. So this quantity here now, the bias, is exactly the optimization error, right? Because I am applying the gradient method
to minimize a function. Therefore, this is exactly the optimization error. And here, instead, I have the sampling that comes into play. And so as I showed you before with the picture, the idea will be to prove that this quantity is increasing with t, decreasing with n.
But yes, and this quantity instead, of course, is decreasing with t. Therefore, the idea of the proof is to balance these two terms. OK, so I prepared just a few slides to give an idea of the proof.
As I said at the beginning, so I don't forget, as you can see also already from here, the square loss plays a very crucial role because otherwise, I do not have this expression of the iterates. So the idea is that I can write the iterate at the end of one epoch as a perturbed gradient descent
iteration. So it's almost a gradient descent with step size gamma, but we have here a perturbation term. And so the idea will be to, yes, to compare this with the analog in the continuous setting. So here, I have this operator. It's just this one covariance operator.
Second moment. And OK, this one is an element in h. And then here, you have these two quantities that are quite complex, as you can see. So that's the point. And the same you can do for the iteration
on the expected risk. And you get more or less a very similar expression here, very similar a and b. And so what you do basically is to write the difference between these two iterates by a sum of an operator, which has norm less than one. And so you forget it. And then you have this sum here,
where you have always, let's say, an empirical mean and its expectation. So you compare the empirical mean with the expectation. And so basically, you can apply a concentration inequality to this. The only problem is for these two complex terms here, because they are not sum of independent variables,
but they are still sum of martingales for the Pinellis concentration inequality still apply. And then you get the bound of this form, increasing in t, decreasing in n, as I was saying, with high probability. Instead, the approximation error is standard, is well known. And you can prove that the rate
depends on the source condition. And on the sum of step size that are constant. And so the final result is obtained by balancing simply these two terms. And you obtain the expression that I showed. So the contributions, I think that we add some results
that are the first results explaining, trying to explain the generalization properties of some used heuristics for stochastic gradient, and more specifically, multiple passes. Probably stopping. And so some future work I already mentioned. Therefore, some of these mentioned already Francis.
So I think that's all. Thank you. OK. This is the paper. Thank you. So further questions and comments for Celia. And Celia can start, if you want. So there was no notion of randomization within a path.
OK. So we started from the cyclic one, thinking it was the easiest. Probably, I don't know if for the analysis, it is the easiest one. But in any case, I think that already Lorenzo Rosasco with Junong Lin tried to generalize these results,
both to no square loss and to other sampling techniques. So in particular, the stochastic one. So you visit your points multiple times, but you randomly select one. What is still missing, which would be very interesting,
I think, is the random reshuffling, let's say. Because here is you're saying my data came from some distribution ID. Then I have whatever order, and I'm just running it through it, and you're saying everything is fine.
It works. It works. I mean, also in the deterministic cases right in optimization. But here, it's all asymptotic, I guess. No. The key is that you do that full gradient of the infinite risk up to a point where you start with that error, and this is the t star of n. I'm just trying to reconcile this kind of result
with the earlier results we talked about during this workshop, where having random permutation versus random sampling versus- The random, I think, yeah, the random sampling comes from the fact that you have random, so your points are ID, and that's it. So that's like your random permutation? I think so.
And then after this, you can, let's say, this is just an optimization procedure to minimize the, it's not true, but to minimize the empirical risk. And you can either access to visit your points randomly. You can visit your points cyclically. You can choose to visit your points by randomly reshuffling them after each epoch.
So from the optimization point of view, this last policy seems to be the best one empirically. There are some results, very recent results, to show that it works also in theory, but I don't know exactly the constants out there.
So what is known that, at least up to my knowledge, is that the increment in the deterministic setting, so the incremental choice of the points behaves usually worse than the stochastic sampling. But for instance, in the strongly complex case, they are the same. Then there is a recent paper by Gurbuzu,
Balaban, Parillo, and Ozdaglar. Yeah, but there the constant is awful. Yeah, yeah, the constant. Yes, because it's a matter of the constants. But still, it's linear. So let's say, we don't see this effect here. I don't think so.
What could happen, but I don't think so, that if your optimization procedure is better, faster, probably you can stop earlier, but would expect this more from an accelerated method than from the random after one. Yeah.
Sorry. Another question for Silvia. Yeah. Could you show a question to people? That's good. Could you show your convergence rate again? Again, yeah.
The best r is the infinity. Yeah, yeah, but you cannot choose it, so it comes with your problem, right? So it's a datum of your problem. So what's the typical value of r?
I don't know. I don't know. If the data is IP, in the traditional framework, we have the convergence rate is square root n. But in your case, your best convergence rate is square root n.
Yeah, be careful, because here we are looking at convergence of the iterates and not of the error. So the square root of n that you're talking about is on the error, OK? So in this case, on the error would be two r divided by two r plus one, so it should be better than one-half, I hope.
You should compare one-half with this. So when r goes to infinity, this goes to one over n. OK. Other questions? Guillaume, so I have a question related to
basically at the beginning you talked about regularization, and now this approach is not using regularization at all. So basically, or maybe not. So I basically have two questions. So if I know the value of r, which one is going to be faster
to use early stopping or to, I guess it's only one pass. If you know r, I would suggest to use one pass, because it's one pass.
But the point is that usually you don't know it. And yes, the regularization comes from the early stopping, if you want. It's the idea that is well-known in the inverse problems literature, where you, in a sense, exploit some self-regularizing effect of iterative procedure.
So you can interpret them as regularization procedures. And therefore, the early stopping of iteration allows you to get a form of implicit regularization. I understand, but could you add some, I mean, the fact of adding some regularization would make your problem more strongly convex.
So it seems it would help you to make progress at the beginning, and then is there a way to add regularization, basically you want to get an idiot, and you want to do anything. No, I think it works. So you can add it. You can apply it to just a lot. Also, I think this analysis in the sense,
but probably can be refined. What I fear is that the lambda will depend on r. Again, so you will have something like this, where you should then cross-validate also on this parameter. If you use a fixed lambda, maybe you can change that.
So if you use a fixed lambda, I don't know if you are able to approximate the true solution. I don't think so. I'm not sure. No, I meant that you could allow lambda to change with t, and not only with n.
Oh, yes, yes, this is also, sorry, yes, of course. So I compared my results with, let's say, the results in the online setting where the horizon is known, because this makes more sense in the sense that we know the horizon. So it would be, but of course, so here, so all these quantities here are constant with respect to t,
but in the true online scenario, you will have something that varies with t, more or less in the same way. So it will not be constant. Yes, you could do it, but I guess that will depend again on R, also in this case. Last question.
And when you don't know R, what kind of model testing procedure would you recommend? Cross-validation, maybe, because the stopping. So also for the- Guarantees on- Yes, yes, yes, you have guarantees, you can achieve the same rates with cross-validation. So for this, so for the iterates, you need a balancing principle,
but since the stopping time is the same, both for the risk and the iterates, you can play a little bit with this and use cross-validation. Thank you, Silvia again.