Highly-Smooth Zero-th Order Online Optimization
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 5 | |
Anzahl der Teile | 10 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/20253 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
Konkave FunktionFunktion <Mathematik>Globale OptimierungStochastikDreiNebenbedingungGraphFolge <Mathematik>DifferenteSterbezifferArithmetisches MittelLokales MinimumKonditionszahlGüte der AnpassungBerechenbare FunktionTopologieMathematikFreie GruppeMomentenproblemAggregatzustandResultanteFunktion <Mathematik>Maß <Mathematik>AbstandKonkave FunktionRechter WinkelZentrische StreckungVollständiger VerbandErwartungswertGraphGradientMatchingIndexberechnungGerichteter GraphGrothendieck-TopologiePunktAdditionFolge <Mathematik>ÄhnlichkeitsgeometrieModelltheorieGebäude <Mathematik>RichtungSignifikanztestt-TestMaßerweiterungEinfügungsdämpfungSortierte LogikMultiplikationsoperatorMinimalgradGruppenoperationKlassische PhysikRegulärer GraphGlobale OptimierungDerivation <Algebra>MittelwertMengenlehreNormalvektorZweiStochastikF-VerteilungRechenschieberTeilmengeNebenbedingungSchätzfunktionThetafunktionParametersystemLineare RegressionGlatte FunktionKarush-Kuhn-Tucker-BedingungenComputeranimation
09:23
Glatte FunktionKonkave FunktionMereologieGraphDreiLineare AbbildungLokales MinimumRegressionsanalyseLogistische VerteilungLipschitz-StetigkeitLemma <Logik>Kompakter RaumZahlzeichenKonditionszahlMaßerweiterungMittelwertGradientAusdruck <Logik>Objekt <Kategorie>BaumechanikExpandierender GraphEntscheidungstheorieWechselsprungQuadratzahlRechter WinkelStatistikRauschenWärmeausdehnungDifferenteGeradeMinimumTopologieKategorie <Mathematik>MereologieKonvergenzgeschwindigkeitSpieltheorieInklusion <Mathematik>EinsProzess <Physik>Funktion <Mathematik>DimensionsanalyseAbstimmung <Frequenz>Glatte FunktionGraphVerschlingungDifferenzkernGruppenoperationLokales MinimumGrenzschichtablösungSterbezifferEbeneKonkave FunktionArithmetisches MittelKarush-Kuhn-Tucker-BedingungenEllipsoidmethodeExogene VariableBetafunktionZahlensystemPolynomMinimalgradStochastikNormalvektorGlobale OptimierungStichprobenfehlerZweiModulformVorzeichen <Mathematik>GleichheitszeichenTermDerivation <Algebra>Quadratische FormKlassische PhysikLogistische VerteilungTaylor-ReiheMaß <Mathematik>Prädikatenlogik erster StufeEllipsoidPunktComputeranimationDiagramm
18:46
DivisionQuadratzahlMultiplikationsoperatorGewicht <Ausgleichsrechnung>FrequenzAggregatzustandMinimalgradParametersystemErwartungswertEvolutionsstrategieStellenringObjekt <Kategorie>DimensionsanalyseMereologieDichte <Physik>Minkowski-MetrikRauschenRechter WinkelVollständiger VerbandGravitationFigurierte ZahlGlatte FunktionSpezifisches VolumenWellenpaketOptimierungEinsMomentenproblemGrothendieck-TopologieSpannweite <Stochastik>Ausdruck <Logik>ÜbergangGruppenoperationPunktSchätzfunktionStatistikAssoziativgesetzResultanteGraphProdukt <Mathematik>BeobachtungsstudieKonvergenzgeschwindigkeitEbeneLinearisierungSterbezifferMultiplikationWurzel <Mathematik>Lokales MinimumKonkave FunktionJensen-MaßProjektive EbeneGradientLipschitz-StetigkeitEntropieEllipsoidStochastikPrädikatenlogik erster StufeAuflösung <Mathematik>Bellmansches OptimalitätsprinzipRechenschieberGlobale OptimierungKarush-Kuhn-Tucker-BedingungenVarianzEllipsoidmethodeFunktion <Mathematik>MatchingBetafunktionNebenbedingungComputeranimation
28:09
LogarithmusÜbergangswahrscheinlichkeitTermQuadratzahlTorusNumerische MathematikGrundraumMultiplikationsoperatorWurzel <Mathematik>ZweiGradientAsymptotikGraphObjekt <Kategorie>SterbezifferErwartungswertFunktion <Mathematik>SchätzfunktionDifferenteKonvergenzgeschwindigkeitDimensionsanalyseTaylor-ReiheWärmeausdehnungStochastikGrenzwertberechnungVarianzPolynomBetafunktionModulformStichprobenumfangMaß <Mathematik>StichprobenfehlerGüte der AnpassungEinheitskugelMultiplikationStandardabweichungGradientenverfahrenApproximationLeistung <Physik>Berechenbare FunktionRechter WinkelThetafunktionPolygonZusammenhängender GraphMaßerweiterungNichtlinearer OperatorLateinisches QuadratStochastische AbhängigkeitAdditionAbgeschlossene MengeVollständiger VerbandEinsTopologieGerichteter GraphGefangenendilemmaKonditionszahlMereologieSchwebungKrümmungsmaßVektorpotenzialRelativitätstheorieSchlussregelEreignishorizontFluss <Mathematik>Computeranimation
37:32
Konkave FunktionRechenschieberApproximationBeweistheorieBetafunktionSchätzfunktionGradientKonstantePunktGlobale OptimierungMaß <Mathematik>ÜbergangswahrscheinlichkeitMultiplikationsoperatorFunktion <Mathematik>GraphRauschenUnendlichkeitStrahlensätzeStichprobenumfangSterbezifferAuswahlaxiomParametersystemStochastikQuadratzahlsinc-FunktionTermNebenbedingungGammafunktionResultanteStichprobenfehlerMittelwertGradientenverfahrenStellenringZentrische StreckungSummierbarkeitModulformOrtsoperatorFrequenzAggregatzustandArithmetischer AusdruckBerechenbare FunktionArithmetische FolgeGerichteter GraphPhysikalisches SystemAdditionObjekt <Kategorie>KonditionszahlFormation <Mathematik>ÜbergangEinfacher RingDisjunktive NormalformWinkelSummengleichungMinimalgradLeistung <Physik>Güte der AnpassungVollständiger VerbandBasisfunktionComputeranimation
46:55
PunktKonditionszahlSummierbarkeitStellenringGraphUngleichungFrequenzGefangenendilemmaRechenschieberEntscheidungstheorieAdditionMinkowski-MetrikBeweistheoriePhysikalisches SystemResultanteOrtsoperatorFunktion <Mathematik>Vollständiger VerbandTeilbarkeitFreie GruppeGradientErwartungswertMittelwertFeuchtigkeitFokalpunktZentralisatorSpannweite <Stochastik>DifferenteKonzentrizitätMengenlehreGammafunktionTermSterbezifferMetaheuristikStrahlensätzeGlobale OptimierungModulformParametersystemKarush-Kuhn-Tucker-BedingungenStichprobenfehlerStandardabweichungSchätzfunktionKonkave FunktionBellmansches OptimalitätsprinzipEinfügungsdämpfungAlgebraische FunktionComputeranimation
52:30
ResultanteKarush-Kuhn-Tucker-BedingungenQuadratische FunktionQuadratzahlGeradeObjekt <Kategorie>MultiplikationsoperatorGlatte FunktionGradientGlobale OptimierungMengenlehreKonvergenzgeschwindigkeitSterbezifferBeweistheorieRechter WinkelWurzel <Mathematik>GraphUnendlichkeitBetafunktionTrajektorie <Kinematik>ZeitbereichRauschenKonkave FunktionFunktion <Mathematik>Prädikatenlogik erster StufeParametersystemNebenbedingungTermTrigonometrische FunktionSpieltheorieKonditionszahlGüte der AnpassungStichprobenfehlerAbstimmung <Frequenz>Gesetz <Physik>AuswahlverfahrenTourenplanungFrequenzUnordnungGerichteter GraphOrtsoperatorPhasenumwandlungVorlesung/Konferenz
58:05
Derivation <Algebra>Funktion <Mathematik>KonvergenzgeschwindigkeitBetafunktionKonstanteGradientenverfahrenLokales MinimumGradientAnpassung <Mathematik>RandomisierungKonkave FunktionVarianzResultanteSchätzfunktionSingularität <Mathematik>ApproximationGlatte FunktionEinflussgrößeRauschenOrdnungsreduktionGlobale OptimierungKompakter RaumArithmetischer AusdruckUnendlichkeitVorhersagbarkeitMengenlehreGebundener ZustandParametersystemPunktFaltungsoperatorAssoziativgesetzKonditionszahlÄhnlichkeitsgeometrieSortierte LogikMaßerweiterungMultiplikationsoperatorDimensionsanalyseRichtungMaß <Mathematik>GrundraumKombinatorArithmetisches MittelGruppenoperationRechter WinkelÜbergangFormation <Mathematik>TourenplanungKörper <Algebra>SpieltheorieStellenringEntscheidungstheorieProzess <Physik>Inklusion <Mathematik>SummierbarkeitGüte der AnpassungComputeranimationVorlesung/Konferenz
01:06:46
Diagramm
Transkript: Englisch(automatisch erzeugt)
00:15
So first of all, I want to thank the organizer for this invitation. It's always a pleasure to be here.
00:22
It's quite an impressive place. And I have to say that this is joint work with Francis. You might know him. And as Simon taught, he's going to be on zero-solver stochastic optimization. And the main assumption is that we are going to focus on functions.
00:42
So we want to do optimization of convex functions that are very, very smooth. So you will see later what I mean by that. Well, that's it. OK, so what's our motivation? So as the title suggests, we have lots of keywords in this title. All of them has a different meaning.
01:01
And what we want to do is classical convex optimization. That's the main idea. So we have a mapping f, which is convex, of course. And we want to optimize it. And we're going to assume some regularity on this function. Typically, it is either smooth or non-smooth.
01:23
And I don't know if anybody is that familiar with the vocabulary of convex optimization. So we'll just remind you a bit on all the keywords that we use. Mapping, we use f. We say that it's non-smooth if we have some kinks on it. So it seems, for instance, as an inch loss.
01:41
Or it can be smooth. Smooth if, typically, all its second derivatives are bounded. But typically, in machine learning, when you're looking at a function that you want to optimize, if you take the logistic regression or if you want to minimize the square norm, your functions are not only smooth. If you compute the second derivative, they are bounded.
02:01
They are not only smooth, they are really, really smooth. So for instance, if you take the square norm, If you take the square norm and you take the second derivative, it's bounded. The third derivative is zero. So all the fourth and so and so, all the derivatives are bounded by zero. And this is the same thing for the logistic regression. So here, we typically are going to try to aim at optimizing functions,
02:26
convex functions, that are smooth. But we don't use the fact that they are really smooth. The question is, can we use this higher degree of smoothness to improve rates of convergence? And that's the main topic of this talk.
02:43
So as I said, we're going to look at stochastic optimization. That means that when you're doing mixed queries, you get noisy feedback. So for instance, let's assume that your mapping f you want to optimize is the Euclidean distance, the expected distance between your estimator of theta and the true parameter of theta.
03:01
Actually, you don't have access to, when you query f, you don't have access to this value, you don't have access to f of something because you only get an estimate, so you get some noise, you get some random feedback on this mapping. So you just observe or compute the norm between theta hat and theta and you don't get to see this expectation.
03:22
So that's where stochasticity appears. We have noise. And so what's our objective? So as I said, we want to optimize, so we want to do stochastic optimization of a convex function. So this is an optimization algorithm, how does this work? So you have this unknown mapping f that you want to minimize.
03:42
So you have a constrained set x, a subset of RDs that you know it. And let's go, let's see how it works. So when you're doing optimization, you're going to query your first point. So you say, okay, what's the value of, so you're going to query x1. And typically, we get some feedback. The feedback you can get, it can be different depending on what's your computational power.
04:04
For instance, you can get the action of f and then you can do Newton's method if you can do it. Or maybe if you have less computational power, you will only have access to a gradient of f, you can compute the gradient of your function. Or if you have even less computational power,
04:22
maybe you will just get the value of f. And in all these assumptions, every time you get the value of something, it's noisy. So either you get the value of f plus some noise, or the gradient of f plus some noise, or the action plus some noise. And here, the title suggests that we're going to look at zero-thouder method.
04:41
Zero-thouder means that you don't have access to the gradient nor the action because it's too difficult to compute. You cannot do it. You will just have access to the value of your function. So let's go back to the typical optimization algorithm. So you make the first query, x1. Then you get a feedback, which is the value of f at x1 plus some noise.
05:01
Based on this, what you're going to do is you're going to output what you think is the minimum of your mapping f. And we'll call this x star 2. x star 2 is what you think is the minimum of f. If you are lucky and you can query a second point, you can make a second query, then you will query a second query. So you're going to make this second query, x2.
05:22
You will get your feedback. So in this example, f of x2 plus c2. You're going to output x star 3, which is what you think is the minimum of f, and so on and so forth. And the idea of optimization is to minimize your optimization error, which is the difference between what you think is the minimum of f after t steps.
05:43
So it's x star of t plus 1 minus the minimum of f. So that's what we want to do. We want to aim at building a good optimization algorithm that solves things like this. So why I am motioning all this almost common knowledge on optimization
06:03
is because actually the first word of the title, maybe not the first one, I don't remember which one actually. It's the last but one word of the title says online. So what's online optimization? So online optimization is exactly the same thing as the classical stochastic convex optimization,
06:22
except that the mapping f that you want to optimize, so it's still convex mapping, but it can change two times. It's not always going to be the same one, f, but when you make the first query, it's going to be f1, then it's going to be f2 and f3 and so on. And typically, so we have, as I say, we have a sequence of mapping. We don't have just one mapping, f.
06:41
We have a sequence of mapping, f1, f2, f3, and so on and so forth. And when you make a query x1, you get a feedback f1 of x1. Or if you have enough computation power, you can get the gradient of f1 at x1 or the asian of f1 at x1. But here in the settings, we're going to assume that we just, we can only compute the value of,
07:01
we are doing a zero for that. That means that you only get a feedback f1 of x1. So the thing is, that is the same. You create x1, you get f1 of x1. You output x star 2, which we think is the minimum of f2. And then you query x2, you get the feedback f2 of x2. Then you output x star 3 and so on and so forth.
07:22
And while in classic convex optimization, we're looking at the optimization error, which is f of the guess minimum of your function minus f star. Here we are going to look in this online scenario at a criteria which is weaker than this, which is called the regret.
07:41
So Emily spoke about this in the last talk. The regret is that thing. And typically you compare, so you have the seconds of mapping f1, f2 and so on and so forth up to ft. And you compare the minimum of all your x of the cumulative loss of x star.
08:01
So you take the cumulative loss, the average loss, and looking at the point that minimizes average loss. And you compare this quantity to the cumulative loss that you incur by outputting x star t at stage t. So that's the regret, the difference between the cumulative loss minus the best loss in insight,
08:21
which means if you knew in advance the average function f, you will predict x star at each stage. So here's just a disclaimer. This is not exactly a bandit, for those that know what is bandit convex optimization. This is not bandit because in bandit, we typically assume that x star k is equal to xk. And here, I just assume that x star k has to be in the constraint set.
08:44
I can query my function anywhere, wherever I want. It does not have to be in the constraint set x. So here, the slide was just to define what is online optimization. But to be precise, from now on, I will not speak about online optimization at all.
09:02
It was just to mention that I can do it. And actually, so we see the results. All the results that we have in classical convex optimization, holds for online optimization for free. And typically, we just show one slide of proofs, and you see why everything that I said,
09:23
all the results in convex optimization, holds for online optimization. And actually, to be even more precise, you will see why online optimization is actually simpler than convex optimization in some sense. Okay, so in the title, I had this long title.
09:43
Here, I just spoke about convex optimization, online optimization, stochastic optimization. And in the title, I made precise assumption. The first one is, so in the title, I had the assumption of smoothness. And we also going to make another assumption
10:00
on the convexity of our mapping. So this is rather classic in convex optimization. We can assume, so we have two types of assumptions that helps us to solve a convex optimization problem. The first type of assumption is that f is mu strongly convex, and the second is f is too smooth.
10:21
As you can see, this is more or less the same assumption, except that one sign here is bigger or equal to, or less or equal to. Okay, so that's kind of the dual condition. So the first one is what we call being mu strongly convex. And typically, it says in one dimension,
10:41
your convex mapping that you want to optimize is going to be strongly convex, mu strongly convex, if the second derivative is bigger than mu. So you know that any convex mapping is by definition zero strongly convex. But here, when you are mu strongly convex, so when you are more than strongly convex, it helps you for the optimization because you can see that your mapping f
11:02
is lower bounded by a quadratic term. The second type of assumption that we're going to make, or not, depending on the setting, is the smoothness. And typically, in the literature, when you look, what is common is to look at too smoothness. It means that your gradient is Lipschitz,
11:21
or stated otherwise, that you can upper bound your mapping f by another quadratic term. So if you are, at the same time, mu strongly convex and too smooth, your mapping f is lower bounded by a quadratic term and upper bounded by a quadratic term. And both of these assumptions helps you, and you can get faster rates of convergence.
11:43
And now the question is, can we use this kind of assumption in the stochastic online convex definition with zero-sorder feedback to improve rates of convergence? Yes? Why do you use m2 squared, is that just like L or something? So why do we have a square here? Because on the next slide, instead of having a 2,
12:01
I will have something different, like a 3, and I will put a 3 here, just to scale, so that the problem scales. But would you need d? It's, yeah, as Francis is very careful of, about the homogeneity of your, of m. Okay, because if, yeah, that's it. But why not mu squared?
12:20
Because, why about mu squared? Because the units are different. Because here, no, no, because here, okay, I have no idea. Okay, so I mean, the true reason is that I'm not going to look at, their very strongly convex function, where you can lower bound your mapping
12:42
f by a polynomial of degree 3. But if we were doing this, like a lower bounded f by something which is on degree 3, you should put a 2 here. But here, as we are not doing it, we're not putting a 2. Okay? But that's a very good question. We have to find another notation
13:01
and it's complicated. Okay, so if you want, just take your mu to be sigma squared. That's another crappy response. Okay? So, just a picture if, for people that are not used
13:21
to convex optimization. So, let's say you want to optimize the mapping in black f and being smooth means that you have a quadratic form. So, this is a quadratic form, by the way, above your f. And being strongly convex means that you have another quadratic form
13:41
in red below it. So, here I want to, if I want to do a disclaimer, I can say that this is because of the, you know, the parallax. But if you are where I am, it's really a quadratic form. And so, just,
14:01
So, I have to speed up. So, just a few words on this for those that are not that familiar with convex optimization. See, if you are at x, if you are looking at x and you know that your mapping is too smooth, then you can make a big jump because you know that you can still create that point here.
14:20
And if you create y, so you are at x and you create y, you know that your error is going to decrease by that quantity. So, you can make big jumps. And if you are most strongly convex, it tells you that your mapping, so if you are lower bounded by a quadratic term, it tells you that. So, if you think the minimum is around x, it's not going to be that far away because you know that the minimum is going to be
14:40
on that part of the plane. So, you can chuck a large part of the plane where you know that the minimum is not going to be here. So, being strongly convex helps you as well as being too smooth. No need to spit, I was making a joke about 420, so it's called how well to smoke pot. Okay, sorry.
15:02
Given your quadratic functions, I was worried. Anyway, so just to answer your question, Simon, why do you have a 2? Because for the classical notion of smoothness is being too smooth, being too smooth
15:20
means that your second derivative are bounded by m2 squared. And it means that the difference between f and your Taylor expansion of degree 1 is a polynomial in 2, of order 2. So, here we're just going to say that f is beta smooth, with beta bigger than 2,
15:41
if we have the same thing, but instead of looking at the Taylor expansion of f of degree 1, we look at the Taylor expansion of f of degree beta minus 1. And being beta smooth means that the difference between f and its Taylor expansion is smaller than y minus x to the beta. And if you don't know what is Taylor expansion in higher dimension,
16:01
as I did, this is a formula, but let's get rid of it. So just remember that being beta smooth means that you are close to your Taylor expansion of order beta. And this is typically an assumption that, again, will hold for the function we want to optimize in machine learning. Think of a square norm
16:23
or a logit. Anyway, so the aim is to use the fact that we know that f is really, really smooth, beta smooth, not only too smooth to increase the speed of convergence. Just a few words on this assumption, being beta smooth.
16:42
So it's kind of, if you look at this definition, if you're zero smooth, it means that you are bounded. So if you're convex and bounded, of course, it has to be on the compact set, otherwise you are constant. So if you're zero smooth, it means that you are bounded by M0. If you are M0 to 0, if you are one smooth, it means that you are Lipschitz, M1 Lipschitz. Two smooth is the same definition,
17:01
of course. And if you have a mapping which is beta one and beta two smooth, with beta one smaller than beta two, then you are smooth for all value of beta between beta one and beta two. And for the function I mentioned before, for the logistic regression, or all the
17:20
expected quadratic norm, then you can compute all those explicitly, you can compute explicitly all those mapping, all those quantity M beta to the beta. For logistic regression, M beta increase only linearly in beta. So it's pretty small. Okay, so it was just to mention
17:41
this assumption. And now, what are we going to do? And what are our objectives? So before going into the describing what are our objectives, let's do a bit of review of literature of the classic optimization method. So the first one, if we are doing
18:01
a convex optimization without noise and with access to the radians, so it's not at all what we are looking here, but just to get some intuition. So if we're looking at so you have access to the radians, so you're looking at first order method, and there is no noise, we can find an optimal, and I'll put a quote on this optimal method to minimize the convex function,
18:21
which is the ellipsoid method. So for those that don't know it, so let's say that at some stage you know that the minimum of your mapping f is in this ellipsoid in black, then you create the center of your ellipsoid, you get the gradient, and then you know that the minimum of f is not going to be, since s is convex, on the left part
18:40
of this ellipsoid. So you can just check this part of the ellipsoid, and you will know that the optimum is actually in the right part of the ellipsoid here. And the ellipsoid method is basically saying, okay, instead of looking at this weird shape here, I'm going to expand it as a finder, so another ellipsoid that contains it,
19:01
and I know that at the next stage my minimum is going to be in the green ellipsoid here. And Xt plus one will create the center of that ellipsoid, check alpha of eight, alpha of eight, and so on and so forth. And if you look at this algorithm, all of the sequence of ellipsoid you're going to build like this, the volume decreases by a constant factor,
19:21
and so you will converge at a linear rate. But this has to be without noise. If you have noise, it's not working, so you have to create several times the same point, and it will never work with, or almost never work, I should never say never, it won't work with online optimization. So we will not look
19:42
at that kind of techniques. Instead, we're going to look at gradient method, because gradient method works pretty well with online optimization. Gradient method, I don't know if anybody knows, but the algorithm is that one, so you have a constraint set X, and the gradient method, if you don't have any noise
20:01
and you have access to the gradient is exactly that formula, is that Xt plus one is Xt minus, and you're doing a gradient step, so you're querying as Ht plus one Xt minus eta times the gradient of f at Xt. And if you go outside your constraint set X, then you just project back on X if you can do it.
20:21
And this is, so it's a very powerful method, and you can have explicit rates of convergence. So if you're mapping, or if you don't have any assumption on your mapping f, except being smooth, so again this is if you don't have noise, and if you are access to the gradient. So if your mapping is smooth,
20:41
no more assumptions as being smooth, so that means your lip sheets, the rate of convergence is in one over square root of t. As I told you before, if you had assumption like smoothness and strong convexity, the rates of convergence are going to be faster and faster because it helps you. So too smooth, you will go from one over square root of t to one over t, which is pretty,
21:00
so it's faster, and you can even accelerate this algorithm by using another algorithm, and get rates of convergence in one over t square. So if you want smooth, instead of adding the assumption that you are too smooth, but you add the assumption that you are strongly convex, then you get rates of convergence in one over t,
21:21
well actually one over mu t. And if you add both assumptions, so if you have both smoothness and strong convexity, then the rates of convergence is linearity. And this tells you that when you have smooth, when you add smoothness and strong convexity to your problem, you can converge faster without just assumption.
21:41
And so our idea is to do the same thing with stochastic optimization with zero disorder feedback. So zero disorder means, again, that you don't have access to the gradient, you just have access to a point, to the value function of f at the point. So now there is this, if you don't have noise,
22:00
there is this trivial argument, which is unbeatable, that says that if you're in one dimension, and you're trying to estimate the gradient of f, you just have to query f of x plus delta minus f of x minus f of x minus delta over two delta. And this is, if delta is equal to zero, or very, very small, this is exactly the gradient of f. This is totally true.
22:22
But, so you could say that you could get the gradient of f for free just by using 2D queries, or D queries, depending on your algorithm. And so the rates of gradient would be the same, except that t is transformed by t divided by d. So this is correct, when you don't have noise. But as soon as you have noise,
22:41
everything about, I mean, this is no longer true when you have noise, because when you have noise, let's say you had an epsilon here, and we divide by delta, and so this blows up. So this idea is certainly true when you don't have noise, but when you have noise, you certainly don't want to use this, with delta equals to zero. So there is,
23:04
in zero-sorder method, there is a way to solve this kind of problem, using really, so here you're trying to mimic first-order method using zero-sorder feedback, but there is a way to solve convex optimization using really true zero-sorder method.
23:21
And in one dimension, it's quite simple actually. So let's say you have, you query those three points here, you query the left, middle, and the right. And if you see that, without noise, if you see that the value of your function is like this, so the middle one is the smallest one, you know that the minimum of your function is going to be in that part of the plane. So you can remove
23:41
the right part and the left part. Here, if the three points, the three values of your point of query are increasing, you know that the minimum of f is going to be on that left part of the plane here. So typically here, when you're doing three queries at a time, you are able to split the remaining state space in two.
24:02
So this will give linear rates of convergence. And this is true in one dimension. You can say, okay, but you can do that because you can do binary search in one dimension, so it's easy. But actually, you can do that in a higher dimension and you will get linear rates of convergence. But the issue is that the rates of convergence are, the dependency in t
24:20
is t divided by d to the 7, which is pretty slow when d is big. So we're not going to use that kind of method, but we could, but we're not going to do it. Here, the take-home message of these slides is actually what I wrote here,
24:41
is that when you're going from first order to the 0th order method, you just multiply your rates of convergence by d, typically, because instead of needing one query to get a gradient, you would need a d query to get an estimate of the gradient. So going from first to the 0th order, you multiply your rates by d. So now, if you add those into the picture,
25:02
and if you look at all the cutting algorithms, such as ellipsoid, pyramids, and stuff, you have to create several times the same point to reduce the variance. It's not really interesting, and it's not really working, actually. So instead, we're going to look at the stochastic gradient. So let's say that instead of observing your gradient, that of f, you have a noisy
25:20
version of this gradient. And the stochastic gradient is exactly the same as gradient descent, except that you don't observe this gradient. You're just plugging the noisy version of your gradient. And there is a whole literature on this, on this problem, that says that if you have a non-biased estimate of your gradient, that means that if the
25:41
expectation of xi is 0, and its variance is going to be like d sigma squared, c is the square here, c is the decimal, then you can get rates of gradients of the order of square root of d over t if we have a non-strokely convex motion. And if we add the strongly convex assumption, then the rates of gradients increase from square root of d over t to d over t.
26:02
And if you remember, I know it was quite a long time ago. If you remember, the rates of gradients without noise, it was the same rates of gradients, except that instead of having a d, we had a 1. So it's square root of 1 over t and 1 over mu t. So typically when you go from noise to noiseless to noise optimization,
26:21
what happens is, again, you multiply rates of gradients by d. So when you go from 0th order to first order, from first order to 0th order, you multiply your rates by d. When you go from noiseless to noisy, you multiply rates of gradients by d. So now the question is what happens when you go from first order
26:41
noiseless to 0th order noisy? Well, we expect that we multiply the rates of gradients by d and d, so by d squared. And this is exactly what we are aiming at. So this slide is just a summary of what we know. So that's the first three columns.
27:00
We know that the rates of gradients that we are aware of. That's what we aim at. So if we are looking at stochastic convex optimization, 0th order, and noisy of convex function, we're looking at, instead of having square root of t, adding noisy will be multiplied by d, having 0th multiplied by d,
27:21
we are looking at square root times d squared over t. The same thing here. So that's what we would like to obtain. And that's what we get, which is not the same, I agree. But still, it's pretty much the same one because as you can see when beta is really, really big, it takes beta to be infinity as in the square norm,
27:40
or the logit. We see that as beta goes to infinity, so beta minus one over beta, and beta over beta plus two is one. So what we get is, we get right software agents, we match the objectives. It doesn't match when beta is small, so we just assume that mapping is too smooth but not very smooth.
28:01
It doesn't match for sure, but at least when the mapping is really, really smooth, we recover the right software agents that we are aiming at. And just to finish the lead-after review, what do we know about this problem actually? So on 0th order, there are noisier optimization.
28:21
So we know, we do know, that this square root of t is the optimal speed. So we do get, we know that we can find an algorithm such that we get a right software agent in poly d divided by square root of t. The issue is that we don't know the dependency in d. And here, we really aim at having,
28:40
for instance, in the convex, instead of having a poly d, we really aim at having only d. So if we had an assumption, if we assume that mapping is strongly and smooth, then we do get d square over t. That's a paper in 13, 14, I don't remember. If we have only one of the two assumptions, strongly convex or smoothness,
29:01
then we get a right software agent in d to minus one third. If you only convex, d to minus one fourth. And actually, a paper of that year says that if you are only convex, you do get one over square root of t, but multiply by log t to the d. And this is, you can convince yourself that log,
29:20
actually, when you look at that term, the dominant term is not square root of t, it's log d to the t. Log is very small, but when you multiply to the d, it's very big. And I think when d is 10, no matter what t is, for d, let's say 30, to be more careful. So when d is 30, log t to the d is bigger than square root of t.
29:41
No matter t. Why? Why? Because, I mean, just do the computation. And, you have, it's true for all human t. Okay? So if you take t to be infinity, then it's not true. But if you take t, which is, I think,
30:00
I should have done it, I think it's, if you take t to be 10 to the one in red, then it's still log t that dominates. Okay, so it is for? For all t's that you can imagine. Number of particles in the universe is a times the number of seconds that's spent in the Big Bang or, I don't know, I don't care.
30:22
You can't take Yeah, but I don't care for, because it's in the log. Okay? So you don't have to be homogeneous. Anyway, so, no matter how big t is, no matter the size of t that you can imagine, this is a log d that dominates, unfortunately. But, still it's a, it's a major
30:41
major breakthrough. And, so when you have strongly and a bit as smooth, so there's this paper of Polyak and Sibekov in 99 that gets the more or less the same rates of convergence, but in asymptotic and without the dependency in d. So again, we are aiming at having exactly
31:00
something that depends in as d square over t. Okay, so that's our objectives to get that rates of convergence. We won't be able to get them, but we can pretty close to it. How much time do I have? Do you want to know? Like half an hour? Ten minutes? Five minutes? I have no idea.
31:21
Okay. So, we're going to do that, and we're going to, there is two tricks that we're going to combine. And, the first one is more, as I've known in that settings, it's just a way to estimate the gradients. And, the other one is a way to smooth your, smooth, even more the mappings that are already very smooth.
31:43
So, here we want, we would like to use a gradient method for our problem, but the issue is that we don't have access to the gradient, we just have access to the noisy value of f of x. So, we have access to f of x plus some noise. But, we have this, I mean, this like
32:00
very natural estimation of the gradients that I spoke about before. We know that in one dimension, f of x plus delta minus f of x minus delta over two delta is more or less the gradient of f at x, more or less. If delta is very small, this is true. And, when delta is a bit small, this is almost true.
32:21
But, the thing which is quite surprising is actually this difference here is exactly the gradients of a function f delta, which is almost f. So, that thing is almost the gradient of f, but this is the gradient of function mapping, which is almost f. And, this is the gradient of f prime of delta,
32:41
f prime of delta, where f prime is that function. And, if you look at it, it's just the expectation of f of x plus delta v over the ball, the unit ball. And, so, if we were minimizing f delta, we would have, we can have a non-biased estimator of f,
33:01
of the gradient of f delta, simply by drawing epsilon plus or minus one with one half, and plugging gz to be f of x plus epsilon delta times epsilon divided by delta, because that guy here, when it takes the expectation of epsilon of that thing, is exactly that thing. So, the expectation
33:20
of g is exactly the gradient of f prime. So, this is unbiased of f prime of delta. So, this is unbiased. And, if you look at the variance of your estimator g, it's of the order of one over delta square. So now, if you were to do a stochastic gradient descent with respect to f delta, since it is convex, then the rates of convergence are going to be
33:40
one over square root of t times the standard deviation of your estimator. So, it's one over delta square root of t. But here, you optimize the f delta instead of optimizing f. Then, since they are pretty close, they are delta close, actually, your error is going to be of the order of delta plus one over delta square root of t. And, if you optimize in delta,
34:01
you will get one over t to the one-fourth. And, this is exactly the idea that I mentioned here of Flaxman and his co-author to do that. Which is the idea of Nemerski and Yudin. And, this is the idea of Nemerski and Yudin. Exactly. But, I think, oh, I remember that this is a... So, this idea here
34:21
is actually due to Nemerski and Yudin, as Francis says. So, if we go from dimension one to dimension d, actually, we have the same idea that says that if you look at the expectation of that guy here, on the unit sphere, so, in one dimension, we are sampling epsilon over plus one minus one. So,
34:40
it's something of the unit sphere on dimension one. So, if we sample u on the unit sphere on dimension d, and we take the same guy, then the expectation of that thing is exactly the gradient of f delta. So, this is almost the gradient of f, but this is exactly the gradient of a mapping which is almost f. And so, this is a gradient of f delta, where f delta is the expectation
35:01
of the unit ball of almost f. And, if you do the same computation, you see that the variance of that estimator is bounded by d squared over delta squared. And so, we should do exactly the same trick, the same computation that's here, exactly the same one. You get the right of convergence of delta, d squared over theta to the power one fourth.
35:22
So, as Francis says, this is a result, an idea, due to Nemrowski and Yudin, and then Flaxman and Al, and then used again by Azan, and used by lots and lots of people. So, we're not claiming that that's our idea. Not at all. We're just using it. Our idea is to combine it with another trick.
35:42
And, the trick is to use a kernel to regularize our mapping. So, why are we going to do that? Because, remember, that our mapping were very, very smooth. So, that means that mapping f, we know, remember that f is very close to its Taylor expansion. So, f of x plus a small r is very close
36:01
to its Taylor expansion. So, it's very close to a polynomial in r. And, it's close up to r to the beta. And now, we're going to introduce a kernel. So, a function k. Such that, when you integrate k against r, you get one, and you integrate k against all the other polynomials, you get zero.
36:21
And, you can have explicit form for this, for this kind of kernel. for just here, you can compute them. So, you have explicit form. And, why should we do that? Because, if you use this kernel, and you integrate f against r times k, you will see that, that guy here is pretty close to f.
36:42
Okay, so we have a pretty good estimate of f, which is up to delta to the beta. Instead of having an estimate, as before, when we were using this type of techniques, we have an estimate of the order of delta. Here, we have an estimate of the order of delta
37:00
to the beta of f. So, we can use the fact that f is really, really smooth to smooth our estimates, again, by multiplying by k, and get a very, very precise approximation of f up to the order of delta to the beta. So, if beta is plus infinity, this is actually pretty small.
37:23
So, this is the idea to use this kernel. now, if we look, so, if we combine both, both, both trick, so, if we're going to use an approximation of a gradient of the of the of the function,
37:41
okay, remember, the trick of Zemirovsky said, okay, we have we can build an estimate of non-biased estimate of the function which is close to f, and here, we're going to do the same thing, except that the function which is close to f is going to be even closer to f than the standard estimate.
38:00
And using the estimate is pretty simple. You're doing the same thing. So, you sample v on the unit ball, and you compute f of x plus r delta v, and then you multiply it by ehrkoid of ehr. So, you integrate this against your your kernel. And if you look at that guy here, okay, that's a function, and if you just compute the gradient
38:21
of that function, the gradient of f of r delta is exactly given by that formula. And so, this gives you an unbiased estimator of f r delta. And this unbiased estimator, so, those mapping are pretty interesting because I said before f or f delta is not only close to f,
38:41
delta close, but it's delta to the beta. So, here we have a mapping which is really, really close to f. And the gradients are also very, very close. The gradient of f r delta is very, very close to the gradient of f up to the delta to the beta minus one. And that's what the kernel allows us to do. And if you
39:00
compute the kernel using those expressions, you can compute all those terms here, numerically. And it's only something like beta square or beta maybe to the three. Okay, so here's the only thing which is a bit difficult, the main difficulty of
39:20
these techniques is that the mapping here that we've defined, since we're using kernel, is not necessarily convex. So, this is a kind of a bummer, but it's still a mu strongly convex, at least mu over two strongly convex is f is mu strongly convex. And it's still convex if beta is equal to two. But still
39:40
we can still use it pretty efficiently to improve rate of error. We will just lose a bit of speed, but still we get asymptotical optimal results. Okay, so, yeah. I'm confused that you say it's strongly convex
40:01
if f is mu strongly convex. So it's mu over two strongly convex is f is mu strongly convex. So it implies that it's convex. Yeah, so if f is mu strongly convex, then f our data is mu over two strongly convex, so it is mu, it is
40:26
a major issue actually. Just lose a bit in the speed of convergence, but that's okay. Instead of having a beta, I mean, okay, that's okay. So here,
40:40
yeah, this is, so that's not the main, the setting point of the, yeah. This is just a remark, and this is not very helpful
41:00
for exactly. So let's use both tricks. And what are we going to do? Then we're just going to do stochastic gradient descent with respect to that estimator of our gradient.
41:21
And so we have two main algorithm, one for when the constraint set x is compact, then we can query at each stage, we only make one query per stage, and we're doing this gradient descent. So that's xt is xt minus one plus minus gradient
41:41
step size times the estimate of the gradient of the smoothen approximation of f. And the other algorithm we're going to use is when x is unconstrained, let's say it's a world set to be rd, x is rd, then at each stage
42:01
we're going to make two queries, f of x1t and f of x2t, with independent noise. This is crucial because if noise are not independent, the same noise at all stages, and the noise will cancel out, but here we say that each time we make a query, we get independent noise. So the noise is still here.
42:20
Just to be honest on that point, if you want to do online noticization, it means that we can query twice the same function fd to use that algorithm. So it can be discussed whether it's doable or not. If you assume that it's not doable, then use this algorithm.
42:41
The first one does not use the fact that you can query twice the same function. It's only a function. So you cannot do unconstraint if you cannot query more than once? No. The issue is that when you just query once and you're looking at unconstrained optimization on the online case,
43:00
then you have an intercept that kills you typically. You can choose intercept to be as big as you want or as small as you want and this kills you. When you can make two points, then you can kill the intercept and then you can normalize the problem actually. It's not about the noise. It's really about the intercept.
43:26
So that's the main algorithm. And then the choice of all the parameters. So here the parameter is going to be gamma of t, delta of t, r of t. All those choices of parameters are going to depend on the assumption we
43:40
make on the function. If you assume the functions are mu strongly convex or not, whether we're looking at constraint optimization or not, the choice of parameters will change. So I'm not going to detail all those choices of parameters. For instance, if you're looking at mu strongly convex and constraint optimization, we have explicit choice of parameters.
44:11
mu strongly the of the exploration, because this is exploration. So delta t can be chosen on that order.
44:20
And this is not that big. And at the output of the algorithm, we just output the average query points that we are averaging. And this algorithm gives an error with scale as we wanted in
44:41
So we do get localization error terms that scale as delta square over mu t when beta is plus infinity. So what's delta when beta equals infinity? So when beta equals infinity, I think if I remember correctly, it's like
45:00
a constant. Because here we have beta, we have one over beta, so it's beta over e. So we have beta over e. So delta has to be pretty big.
45:24
I think it's the order of the constant, though. So yeah, typically it's a constant. But beta being really, really big is okay because it means that you are really, really
45:52
really
46:05
smooth. It's almost linear. More in the idea is that. Okay. You can query far away from your point to get a precise approximation of your point. For instance, another way to see it is if you're looking at the quadratic form, if you want to have
46:22
a precise approximation of your gradient of the quadratic form, you can query points which are infinity at plus infinity minus infinity and this will give you an estimate of your gradient. Okay. Well, that's the main idea.
46:41
Now let's look at the proofs. So actually, so this slide is, so you see the next slide is like the most easiest slide you will ever see in your life. But it's just to make a point at the end of the talk. So this is the whole proof of the precedent
47:00
algorithm. So just see that it's only six main arguments and it's actually pretty when you are familiar with context explanation, it's pretty natural. So when you look at it, I'm just going to actually emphasize the fifth point. So you're just looking at how xt moves away from
47:20
x. You use your algorithm to show that this is the standard form. you just explain it and this is the standard form. Here you have an error term that you bound by using expectation and you know that that quantity here is small. Here you have this quantity in orange is by expectation
47:40
more or less a gradient of f delta r and so you can plug your gradient of f delta r here instead of this and you get using the xt you get that type of inequality. Now look at the first two terms. When you have the first two terms here, you see that you
48:00
can divide by gamma t and so that you get at the end the error made by f delta. So that you rearrange this term and you see that everything so here you have xt minus x xt minus 1 minus x here you have xt minus 1 minus x you're doing all your chemistry and it's like standard algebra actually
48:20
and it's not that difficult and you get that form. And here the idea is just if you pick gamma t to be 1 over mu t this term here in orange is equal to that one here in orange and so all those terms is going to cancel out when you will be left with the sum of all those guys. And this
48:40
is exactly what we do here so you have a sum of the remaining term and that's it. So now if you do not follow at all this is okay. I don't expect anyone to follow the last five sentence I made but it's just to say that I'm not
49:00
cheating you and we arrived here at the point of the proof we are here. Okay. And when you are in this stage five of the proof of convex optimization typically what we say is that okay so here I have so that quantity is more or less a constant and what
49:21
I have here is an average of value function of a convex function and I know that this is bigger than the value function of the average. Okay. So I'm using convexity of f here to show that that guy is bigger than f of the average more or less. And this is the last step
49:41
of this standard proof of convex optimization. But here if we look more closely to this step five and you just remove that term because I don't care about it and we just look at that guy here what you see is this is exactly the regret more or less the regret. Okay.
50:00
It says that delta t instead of having f of delta t is f t. Here you have the regret here in that quantity. Okay. It's the average of the cumulative loss compared to the average loss. Okay. So when you're doing a convex condition typically you say that you're using the last step of your proof is you use the fact that the function is
50:32
convex to instead of going to the last stage you just stick to the stage five. So online optimization you can do the same proof as convex
50:40
optimization except that you're doing one step less. Okay. So that's why I said online optimization is simpler than convex optimization because you
51:01
know exactly that the regret arrives here in the proof and this is not much complicated to minimize regret than to minimize your estimation error. And also it shows you that the proof is not that tricky. Thanks. So now
51:20
if we're looking at instead of before I add a strongly convex and the constrained optimization let's say we're looking at unconstrained optimization then we're going to use two points or two-point meta-algorithm. So at each stage you're going to create x minus delta rtut minus delta rtut. You're doing the difference and you add some
51:41
noise. If you pick the same set of parameters more or less and you're doing the uniform origin you get the same kind of results. So you get the same results in constrained and unconstrained optimization except that for unconstrained optimization we need to be able to create twice each function at each stage
52:00
so that we can kill the intercept. So here it just says that I don't have time but if my mapping is strongly convex we can improve this rate of gradients but it's
52:20
not really relevant for now. So same thing if we have only convex and constrained you can choose other set of parameters we do get the results and not lying and if you have convex and unconstrained another set of parameters just believe me or read the paper but just believe me I think it's quicker and we do get
52:40
the rate of gradients. So here was the objective of the talk. So remember the first three columns are what is already known. We know that if we want to do convex optimization with first order noiseless we have one over cot of t. If we had noise we multiply it by d. If you have those order we multiply it by d.
53:00
So we guess I think we not only want to guess that but if we had zero order and noise we should have the optimal rate of gradients in d square over t. That should be the optimal rate of gradients. Maybe in some natural problem. And what we get is that we do get those rates of gradients but up to some
53:20
polar term which goes to one and beta goes to infinity which is the case we are interested in. If we look at new strongly convex same story one over mu t we get d over mu t d over mu t and then we think that the optimal rate of gradients is d square over mu t and we do get them up to some polar which is not the same one but it also goes to one and
53:40
beta goes to infinity. And this is true for convex optimization and as I said before I'm not lying this is also true for online optimization. It's just step five. We won't go to step six we should just stick with almost stick to step five. So you get the same results for
54:00
online optimization and convex optimization. And I think I'm on time so I will skip here so you have the objectives and there are many results on this slide. Thank you.
54:21
Are there some questions? So in the trajectory about the yeah so is this assuming any kind of smoothness to smoothness infinite smoothness no smoothness the conjecture so it's a conjecture so you can
54:40
assume so it has to be a new strongly convex I don't know actually there's a lower bound for two smooth and new strongly convex functions where actually there's a square root no even for zero order the same
55:01
setting that you consider so stochastic zero order optimization strongly convex and smooth functions here you say yeah but I think if I remember correctly it's one of paper right yeah
55:22
so that's easier to so yeah so so the easy easy answer we say okay it has to be smooth but I'm pretty sure that this rate of convergence is not in that setting actually because here we are we are we are we are allowing oh yeah because we allow to
55:40
query outside the we can query outside the constraint set the lower bound holds even if the domain is unconstrained maybe we'll and
56:03
so now I don't remember why they are not the same but they are not actually they are not the same settings and you get you get yeah but in case you're right we're optimal in case you're wrong we have a good conjecture so yeah actually I'm slightly
56:20
concerned that the lower bound construction uses a function which is actually more than two yeah because I think it's a quadratic it's a quadratic function but we had we actually we had this no no no because I do remember that we had this discussion in Singapore but you were giving a talk I said okay this is not the same setting but I don't remember
56:40
why it's not the same thing but there is so we could check offline but this is actually not the same thing and I think this is the idea because here if you look at if you look at convex optimization then you can get this rate of emergence okay for convex optimization
57:02
with noise yeah we check but I agree that for bandits this is because that why I say this is not bandits and I know that why you picked this square because I know that if we are looking at bandits this would
57:20
be incorrect okay we're not doing bandit optimization for bandit you will need the square root of t I totally agree and I think our setting are in between I don't know the same setting as the lower one okay so in the in the first line the our result
57:40
is reading if b is smaller than two you get the right rate and then on the right the last one so that's that's if beta equals two the black one and this is if beta is bigger than two so that's when you don't have a convex f r delta yeah but you can still prove yeah because
58:01
doing the same proof actually so if I just go to the proof oops sorry here f r delta is still convex because we have mu strongly common function but here and you can see I'm just using the approximation of f by f delta so it should be f r delta here at the end but if
58:20
you use it at the beginning then you can do the same proof with with f instead of f r delta and f is convex okay so you do the approximation at the beginning and then you get an extra term that that goes with your other one other way around but you can still do it okay we're not doing gradient descent and convex function we're
58:40
doing a gradient descent and convex function with some noise so somehow if I understand where the magic comes from is that you query very far so that essentially the noise has a small influence on the measurement that you make is that the idea?
59:01
is that the intuition? no it's no the data is actually so I'm pretty sure that if you do the prediction correctly and beta goes to infinity yeah we get delta bounded I think I'm pretty sure not necessarily if beta goes to yeah it can go to zero
59:20
very slowly slower than the rates of convergence actually but that's okay so you can take the delta to be bounded now the thing is if your mapping are really really really smooth you what you get is your you have this bias variance trade off and if
59:40
if your function are really really smooth even with delta bounded you can get very small variance in you can get an estimate of the of the of the of the gradient with a small with a small bias a very small bias at a fixed bias variance okay
01:00:00
You can lower the order, of course. But your trade-off bias variance is improving when beta is improving. It's not really about noise. It's because, yeah, it's really about the function you're optimizing. If it's really smooth, then you can get your variance
01:00:23
is very small. You can lower the variance, or the base, of course. It's always the same thing. So you can interpret it as a variance reduction method? It's funny because that was the question of Alex the last time.
01:00:40
So yeah, I guess we could use a variance reduction method to improve those algorithms, actually, practically. It's a bias reduction method. And that's a bias reduction method. Is everything in the first channel? Would it be more direct if you suppose that f lies in Soholev's space?
01:01:03
Yeah, I mean, so that's a good question we could make on this optimization. We could also, I think, with this kind of assumption, we could improve those rates of convergence.
01:01:22
For instance, if you assume that there are some kind of ideas, you can assume that there are pre-norms of stuff like that, and it could improve the rates of convergence. I think you're totally right. If you make another set of assumptions, like you mentioned, being a Soholev space, or even finite dimensional space, you can improve those rates of convergence
01:01:42
and those techniques, indeed. For me, the motivation was a bit unclear. You said you use a lot of smooth functions in machine learning. But for example, the first example that comes to mind is sigmoid function, but it's not differentiable. And people use some regularizing things,
01:02:02
like convolution with functions with small support to regularize the sigmoid. But still, the only step function, yes, yes, sorry. And the maximum of the gradient is very high.
01:02:24
I mean, if you try, it still will be infinitely smooth. But all these constants and beta will tend to infinity. The question is, oh, yeah, how is it going to infinity? If you're trying to.
01:02:42
So my question is that usually these functions, they have some singularity in a small, compact set. There are some results of Mirovsky for adaptive algorithms, when they work with session versus in a small, compact set. So for me, it was a bit unclear.
01:03:01
Was the motivation to use the maximum of the gradient or maximum of the beta derivative on the bounds? I mean, like having a. I don't see your point. Dodges tick is infinitely smooth with bounding,
01:03:20
like all bounding derivatives. So for me, this tick is enough for motivation. I mean, it's constant, but it's very, very big. Yeah, but I mean, if you're using. I think the correct question is, well, if I have a function, I want to optimize the function, which is not convex, can I use an approximation of it, which is infinity, which is not smooth, I mean. Can I find an approximation of it,
01:03:42
which is infinity smooth? And then use that kind of algorithm on infinity smooth function to optimize my original function. Is that what you mean, that you have a function at the beginning that's not smooth? I would ask what kind of example functions
01:04:00
you apply in these. Well, then it's for the logistic function and all the function, which are the square norm, typically. That's the main example. But yeah, I agree that if you have a function which is not smooth, and you're trying to approximate it with a very smooth function, I'm not
01:04:23
sure that this is the best idea to do that kind of. But it's what myself is doing with smooth optimization for non-smooth functions. It might be beneficial. Yeah, but my concern is, as you said, if the beta is, you have to control all your m beta.
01:04:42
I mean, all your derivative are increasing. And I want to say typically, but it's just a pure random guess, your m beta is going to increase more than your, you're going to lose more in your approximation than you will get from the rest of the versions. OK, but that's a pure random guess, so.
01:05:01
One last question. What happens if you do not know beta in advance, or at least if you don't know the m beta? So if you don't know m beta, it's not that bad. Because here, we just use it. Here, we put m2 and m beta. But if you don't know it, just take it for one, for instance.
01:05:21
And instead of having it inside this nice expression to the nice power, it will just go outside with like a one or a power, which is not as nice, OK? So if you don't know m beta, it's OK. You just don't use it. If you don't know beta, then it's more tricky.
01:05:41
If you know beta, which is, if you know a lower bound and use your lower bound, you will get the same result with a lower bound. Because if you're beta-smooth, then you're less than beta-smooth, if you're bounded, for instance. You always assume that you're too smooth. So if you're too smooth and a large beta, but you just know for sure that you have this beta lower
01:06:02
bound, then you can use the lower bound, it's OK. OK, because of the argument I said before, if you're 2 and beta, then you're everything below. If you have no idea, can we find an adaptive algorithm? I don't really know, OK? And since you can make a query, you just get one point.
01:06:20
I'm not sure whether you can really get, can estimate the beta fast enough. I'm not really sure, OK? And I don't want to make any, I can make another pure random guess. I would say that no, you cannot estimate, but again, it's a pure random guess, so I'm not sure how relevant is it. It is, OK?
01:06:41
OK, OK.
Empfehlungen
Serie mit 30 Medien