We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

When Moreau meets Langevin

00:00

Formale Metadaten

Titel
When Moreau meets Langevin
Serientitel
Teil
3
Anzahl der Teile
6
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Spezielle unitäre GruppeMaßstabBayes-EntscheidungstheorieFlächeninhaltLineare RegressionVariableZahlentheorieBinärdatenBernoullische ZahlFunktion <Mathematik>ZufallszahlenVektorraumKoeffizientDistributionenraumStandardabweichungLogistische VerteilungGleitendes MittelMittelwertBetrag <Mathematik>BAYESAnalysisA-posteriori-WahrscheinlichkeitParametersystemE-FunktionData AugmentationModelltheorieTermLogischer SchlussBereichsschätzungHausdorff-DimensionPoisson-ProzessGammafunktionVerschlingungSterbezifferGeometrieNormierter RaumKorrelationMaß <Mathematik>Lokales MinimumKlassische PhysikZielfunktionSchlussregelDichte <Physik>KonstanteLogarithmusEinflussgrößeTeilbarkeitRelationentheorieEntropieEindeutigkeitInvarianteBrownsche BewegungSpiegelung <Mathematik>FunktionalUngleichungSpitze <Mathematik>Finite-Elemente-MethodeSummierbarkeitGravitationsgesetzMinimalgradDruckverlaufFolge <Mathematik>StichprobeGradientGeräuschKrümmungsmaßKonditionszahlKette <Mathematik>Stationäre VerteilungUnendlichkeitDifferenzengleichungLineare DarstellungVorzeichen <Mathematik>ÜbergangswahrscheinlichkeitKohäsionGlobale OptimierungMultiplikationsoperatorExistenzsatzVektorpotenzialNichtlineares ZuordnungsproblemDiffusionsprozessMengenlehreModulformGleichmäßige KonvergenzTheoremMomentenproblemHelmholtz-ZerlegungKontraktion <Mathematik>Total <Mathematik>AbstandTVD-VerfahrenStichprobenfehlerGesetz <Physik>TropfenGlattheit <Mathematik>DreiKonvexe MengeLipschitz-StetigkeitEinhüllendeNichtlinearer OperatorInklusion <Mathematik>PunktHill-DifferentialgleichungApproximationAdditionOrdnung <Mathematik>BeweistheorieLineare AbbildungEntfaltung <Mathematik>ModallogikMatrizenrechnungKovarianzfunktionArithmetisches MittelSchätzungSortierte LogikKomplex <Algebra>DivisionPhysikerOptimierungGruppenoperationStabilitätstheorie <Logik>Maß <Mathematik>IndexberechnungMultiplikationsoperatorGesetz <Physik>PhysikalismusGerichteter GraphBerechenbare FunktionPhysikalisches SystemAggregatzustandWiderspruchsfreiheitGrothendieck-TopologieDimensionsanalyseGewöhnliche DifferentialgleichungEntscheidungstheorieRechter WinkelKatastrophentheorieÜbergangKugelkappeSchlussregelGlobale OptimierungSchätzfunktionNichtlineares GleichungssystemMathematikBeobachtungsstudieMinimalgradTabelleMereologieStichprobenumfangOrtsoperatorParametersystemModelltheorieArithmetischer AusdruckEinflussgrößeNumerische MathematikMomentenproblemLogistische Verteilungt-TestFrequenzDistributionenraumMengenlehreKonditionszahlMinkowski-MetrikAnalytische FortsetzungKeilförmige AnordnungVariableQuadratzahlTermMultiplikationssatzBedingter ErwartungswertProdukt <Mathematik>WärmeausdehnungKontraktion <Mathematik>MinimumDifferenteInelastischer StoßA-posteriori-WahrscheinlichkeitLikelihood-FunktionVarietät <Mathematik>NormalverteilungSummierbarkeitMultiple RegressionMetrisches SystemKoalitionEmpirische VerteilungsfunktionSpieltheorieForcingGeradeBetragsflächeTonnelierter RaumKonkave FunktionProzess <Physik>OptimierungsproblemStandardabweichungThermodynamisches SystemResultanteSpezielle FunktionÄquivalenzklasseMaßerweiterungEinfach zusammenhängender RaumFaserbündelNichtunterscheidbarkeitStatistische PhysikOvalErwartungswertAlgebraische StrukturEinsStellenringLokales MinimumPunktAbstandWellenpaketGrundraumKlassische PhysikVorhersagbarkeitOrdnung <Mathematik>FunktionalSpiegelung <Mathematik>Logischer SchlussSterbezifferApproximationQuadratische FormMittelwertInzidenzalgebraDichte <Physik>Delisches ProblemUngleichungRauschenStatistikGüte der AnpassungRelativitätstheorieVollständiger VerbandSpezifisches VolumenEindeutigkeitQuaderOffene MengeAnalysisKartesische KoordinatenZeitzoneNatürliche ZahlKonvexe MengeUnendlichkeitsinc-FunktionRandwertExplosion <Stochastik>TeilbarkeitAdditionZahlzeichenEbeneTrennschärfe <Statistik>Formation <Mathematik>Endlich erzeugte GruppeDifferentialgleichungBinärbaumUniformer RaumGradientModerne PhysikModulformEreignishorizontInhalt <Mathematik>VarianzKarush-Kuhn-Tucker-BedingungenKraftFunktionentheorieVektorpotenzialVollständigkeitTotal <Mathematik>EinfügungsdämpfungPerspektiveMatchingAlgebraisches ModellTropfenNormalvektorFlächeninhaltTVD-VerfahrenRichtungObjekt <Kategorie>DifferentialKonzentrizitätLeistung <Physik>Basis <Mathematik>Abstimmung <Frequenz>Schwach besetzte MatrixAuflösbare GruppeMathematische LogikPotenz <Mathematik>Hyperbolischer DifferentialoperatorKonkordanz <Mathematik>Physikalischer EffektEigentliche AbbildungPhysikalische TheorieStereometrieInverser LimesSymmetriebrechungEnergiedichteFamilie <Mathematik>Arithmetisches MittelQuadratische GleichungSchwebungDivergente ReiheZusammenhängender GraphChirurgie <Mathematik>ZweiGrößenordnungGradientenverfahrenStochastische DifferentialgleichungInverses ProblemBrownsche BewegungMatrizenrechnungGammafunktionDifferenzengleichungGlattheit <Mathematik>Ljapunov-FunktionStationäre VerteilungStochastisches IntegralSkalarproduktLineare RegressionBetafunktionGewicht <Ausgleichsrechnung>Kette <Mathematik>VerschlingungAbgeschlossene MengeEinhüllendeKappa-KoeffizientÜbergangswahrscheinlichkeitBeweistheorieZufallsvariableSchranke <Mathematik>Lineare DarstellungZahlensystemKonstanteNichtlinearer OperatorIterationSpannweite <Stochastik>Bayes-EntscheidungstheorieSchwingungKrümmungsmaßGebundener ZustandMethode der kleinsten QuadrateStochastikEntropieMathematikerinRegulärer GraphTheoremPartielle DifferentiationOrdnungsreduktionPaarvergleichKombinatorDifferenzkernAnalogieschlussRechenschieberEvoluteLaplace-OperatorLjapunov-ExponentGeometrieVerallgemeinertes lineares ModellFlüssiger ZustandErgodentheorieExogene VariableHyperbelverfahrenWurzel <Mathematik>ThetafunktionHalbgruppeInvarianteNebenbedingungNegative ZahlParametrische ErregungLinearisierungDerivation <Algebra>QuarkconfinementVektorraumPunktgitterAuswahlaxiomRegulator <Mathematik>Heegaard-ZerlegungStörungstheorieStatistische HypotheseFaltungsoperatorDruckspannungWaveletKreuzvalidierungComputeranimation
Transkript: Englisch(automatisch erzeugt)
This work is a joint work with a PhD student of mine, so we're still in Théo-Comparé,
so I moved last year to a Comparé technique and it's about one part of my work which is on a large dimensional sampling, so it will have a very different flavour from the talk given by Jean-Claude, so it's much more about statistics and probability
and bias and statistics, so it's with no algebra, no special functions, non-linear Fourier transforms, so it will be a completely different set of tools which will be used. So I will start by trying to motivate my talk and then I will go through some technical details so I'm not sure to cover everything and I will then present you some numerical
illustration and some conclusions. So I will start to spend quite a lot of time on the type of application we have in mind and just to justify the type of techniques that we are trying to implement right now. So I am interested in computational statistics
with application to machine learning, so as you know that big data stuff is there, so when you are looking at big data also people start to look at much more complex models, so if you start to collect a lot of data then you can try to model things
which are more subtle, so instead of using classical, very very simple models which are typical in statistics like linear regression, logistic regression or things like this, you can start to ask questions which are much more sophisticated and look at more detailed or complex models. So when you have a complex model then you have a lot of parameters
to fit and since I am more in a Bayesian statistics so it's not only fitting one value of the parameters but trying to investigate the posterior distribution of these parameters which means that you need then to be able to sample distribution over very large
dimensional space. And of course there are tools to do this and these tools have been used for quite a lot of time in statistical physics, so for example famous macro chain Monte Carlo techniques is one of the first algorithm which has been implemented
in the computer to solve statistical physics problems, but in fact there are some special structures in statistical physics which are not there typically when you are doing Bayesian inference where the models are much less structured in some sense, and so deriving techniques to sample this distribution efficiently is really an issue
which is at the heart of many many important problems now. So I'm working more or less on four applications and so these applications are following. So the first application I'm looking at is not exactly in that order so I started to look at what is called the aggregation problem.
So in aggregation problem what you are doing is that you instead of using what you typically do in machine learning, so you start to collect data which we call a training set. Then on this training set you try to estimate parameters and then what you do once you have estimated the parameters on these training sets and you will apply
so you have some parameters so you have trained your classifier or your predictor and what you will do is that you will take some fresh data, a new data set and then you do the prediction. So that's a typical way to do and that in fact it's not there is a better thing to do and the better thing to do is to do some aggregation of estimators.
So aggregation of estimators is instead of from the training set is instead of picking the best in some sense that has data points what you do the best parameter what you do is that you will sample some distribution which is constructed
according to some precise rule I will not go into that level of details and then to do the prediction instead of taking one value of the parameter which have been estimated on the training set you will average, you will compute the conditional expectation on the new fresh data of the predictor
so you will aggregate answers and this is called aggregation of experts and to do aggregation of experts you need to learn in fact the proper aggregation rules and the proper aggregation rules is a kind of posterior inference
so it's kind of sampling according to some posterior distribution over a large dimensional space so it's a first application of this. So I start to look at this to do what is called pack Bayesian problems. So in the PhD thesis five years ago by Benjamin Gage he's now a researcher at INRIA in Lille
and he's still working on this type of topic so aggregation. So that was the first motivation of doing this. The second motivation of doing this is to solve Bayesian linear inverse problems. So we give you an example instead of going into the math but in Bayesian linear inverse problem what you are doing that you are looking at
you have your data which was pollution data and the problem I was looking at so pollution data so you have some boat for example on the sea which is injecting some pollutant at some amount of time and the pollution is not immediately detected so you have satellites which monitors images which collect these data
and what you are trying to do is that you are trying to reconstruct so you have a certain system which is quite complex which is kind of it's a PDE it's a linear PDE in that case with diffusion and transport terms and you are willing to construct to reconstruct the amount of pollutants which has been injected by the boat in that case
so that's really you have to solve a kind of linear inverse problems from observations you have a lot of uncertainty and what you are willing to do is that you are willing to not only to really to have an assessment of all the uncertainty which is there
you have also other examples like in Bayesian inference for high dimensional system I will use this example later on and Bayesian on parametrics but I will not cover these aspects so all these problems and the list is not exhaustive is share the same difficulty
which is sampling techniques for this type of models which are not very well structured and the techniques which are not so far are not very accurate to doing this because I do not care really to add dimensions so now I will use a very simple example to just let you know more precisely what I am looking at
and then I will try to describe an algorithm for doing this so I will use as an example to illustrate the type of problems I am looking at
Bayesian inference so it's my first case so Bayesian inference for a 9 dimensional model so I start with one of the most commonly used problem in statistics and machine learning which is a kind of binary regression problem
so what is a binary regression problem? so what you want to predict answers so the yi are the answers that you are willing to predict so in the binary regression problem these answers are yes or no and you have a set of explanatory variables so these answers which are yes or no are maybe explained
so you answer yes or no to some task but these answers in fact are a condition to some regressors so typically you have explanatory variables and these explanatory variables are maybe extremely large dimensions
so for example you are visiting a website you will click or not click on a given link so these are the binary variables and you have a lot of explanatory variables and all these variables may try to explain
from all these explanatory variables I try to explain the answers so there are many many problems of that sort recommender systems of course are very... in recommender systems because I say it's yes or no but it can be it's 1 to 5 so I can do some recommendation out of that for example because I am trying to predict or I am analyzing
so you give answers and I am trying to... I know a lot of things about you and I am trying to guess to try to interpret the answers that you have given so here it's a yes or no but it can be other discrete data for example instead of being binary it can be a mutinomial variable
it's exactly the same thing and one of the... so one model when you are looking at this binary regression problem one type of model that people have the most commonly used model to do that is to use what is called a generalized linear model so in a generalized linear model what you are modeling the problem like this
you are modeling the answers so yi are Bernoulli variable in that case with a success probability which is a certain function which is called f, so that's a f function which is called the link function of a linear combination of the explanatory variable
so in fact what is the model there so the model there is that you have n so you have a binary variable and this binary variable now is considered to be a Bernoulli random variable with a success probability which is a certain function f of a linear combination of the variable of course you may imagine much more complex models
so it's a very simple one you may imagine for example that you have interaction among factors so instead of having a simple linear model there you can have interaction terms and things like this instead of having a simple function there you can have a function which is much more complex you can imagine things which are much more complex
but that's a basic example of what is called a generalized linear model and one of the examples which is used in many many cases is you take f, the logistic function or the sigmoid function and so you obtain a model which is called the logistic regression function
which is used in many many applications in machine learning and the problem there is that the number of predictor variables can be extremely large so many applications where the number of predictor variables is more than 10,000 or 100,000 sometimes
depending upon the application they are looking at in a topic model for example it would be something like 10 to the power of 5 so which means that you have to estimate or you have a data set which can be very large so I will not discuss that point here today but n can be very large also and you have to estimate parameters in the dimension
which is also very large which can be of order 10,000 or even more okay so that's the setting and what I am willing to do is that I am in a Bayesian setting so in a Bayesian setting what you are willing to do you are not willing only to estimate a parameter
which is typical in machine learning in machine learning what you will do typically is that you will use this data set and try to fit the best between quote parameters so the parameter beta which in some sense explains the best data that you have observed in your training set but in a Bayesian setting in fact you have a bit more ambitious goal
and this more ambitious goal is to sample a posterior distribution so the posterior distribution of the parameter given the set of observation so in a Bayesian setting you will have an additional piece of information so you will put some prior distribution so you will put some distribution on the parameters
which you call the prior distribution and the prior distribution is a way to describe loosely speaking the prior information that you have in the parameter before doing any observation so the typical prior for example is the Gaussian prior
so you put a Gaussian distribution as a prior distribution of the parameter but you can also use Laplace prior so Gaussian prior is a quadratic Laplace prior is simply L1 terms so these two terms have a very different behavior so this one for example is very useful
but we have not discussed that point today so it has the art of compressed tensing and things like this so it's all lasso so perhaps you have looked at some recent work in statistics so L1 prior are extremely popular now
because they increase the favor in sparse solutions which are important and what you are doing what you are doing in Bayesian statistics so what you are willing to sample from you are willing to sample the posterior distribution
so the posterior distribution in Bayesian statistics is the posterior distribution of the parameters given the observation so you apply simply the Bayes rules so when you apply the Bayes rules what you do is that you have the prior distribution of the parameters so it's pi beta then you have the conditional distribution of the observation given the parameters
so it's pi beta that's the conditional distribution of the observation given the parameters and here you have assumed that yi is a Bernoulli random variable with probability of success
which is f of beta transpose xi so it's called the likelihood term and now the posterior distribution of the parameter given the observation is the product of the prior times the likelihood and of course in order to complete this of course you need to renormalize that
but one of the difficulties in Bayesian statistics in general is that it's impossible of course to obtain a closed form expression for this normalizing term so you know in fact the posterior distribution up to a constant so the constant term is so you have to integrate this quantity with respect to beta but remember that beta is in dimension 10,000
something like this so of course it's impossible to estimate the normalizing factor so that's an expression of the posterior distribution but this posterior distribution is not only up to a normalizing factor so as I said
there are a lot of works in computational statistics on this logistic regression model and the techniques which are used are typically called data augmentation and we're not discussing this point today because if you're not an expert in MCMC
it's not really needed to understand what is the data augmentation stuff here but the only thing that you need to know is that up to now these techniques the convergence time is prohibitive as soon as this is larger than 100 so there is at least one or two order of magnitude so we are willing to attack problems
whose number of parameters are of the order of 10,000 or more and whereas the techniques which are known today are not appropriate when you go 100 is really the limit for example from the polyagonal sampler it's much too complex otherwise so it has been considered as a daunting problem
but in fact I don't think it's a daunting problem I think that the techniques which have been used are not appropriate because if you look at in the link prediction so the logistic regression
is behind most of the techniques it's one of the workers of Google and Facebook in fact to do link prediction and they use really regression over large dimensional space so in fact there are a lot of structure in the problem if you look at the problem now
if you look at the sampling problem that you are looking at so you are trying to sample the distribution which is exponential of minus u beta and u beta in that case which is given there which is a potential it's a Gibbs potential if you wish and the Gibbs potential has two terms so it has a term which is a log of the likelihood of the observation
and these terms in that case it's convex so it's not completely clear but it's convex in this case at least when f is a logistic function and you have a penalty term which is either the norm 1 or the norm 2 of beta
which can be transformed by some design matrix which is called b so in terms of optimization if you are simply looking at the optimization problem it would be in that case it can be a strongly convex potential
so it means that if you simply are willing to optimize with respect to the parameter beta you simply have to optimize the convex function in large dimension which is doable with even if beta is if you take the one norm there
it's a non-smooth convex optimization problem but you have tools to do that so in terms of optimization it's a rather simple problem whereas my guess was in terms of sampling it should not be a very difficult problem
and I think it's not a very difficult problem if you use the proper tools so I will start with what I call the smooth case so in the smooth case I will assume that beta is differentiable which means that here I will first look at the case where this term there
which comes from the prior is differentiable and then I will explain more briefly how you can use this type of stuff when the prior distribution is not differentiable so then you need to do some smoothing
in order to apply that so I will start to change a little bit the notation so now what is the problem so I will try to look at try to sample this distribution which is given by this
the distribution is given by this so it's given its exponential of minus U of X U of X is a Gibbs prior so you don't know typically the normalizing constant so the dimension D is very large and since I have what type of smoothness I have assumed so I assume that U is L smooth
so which means that I assume that U is continuously differentiable and that this Jacobian of U of X is Lipschitz which is given by L which can be computable in fact it's very important so it's very important in the sense that
one of the consequences of this is that I have assumed that the growth of the gradient of U of X at infinity is less than it does not increase more than the norm of X it's sub-linear at infinity which is a consequence of being
a global Lipschitz so the global Lipschitz assumption is important there if you look now in continuous time in continuous time you have a diffusion so I can consider this diffusion so this diffusion has been introduced by a French physicist
called Paul Langevin which is one of the founders of modern physics to describe the behaviour of a particle in a liquid and he has proposed this equation which is called a diffusion so if you are not familiar with diffusion
it's a kind of it's like an extension of ordinary differential equation it's an ordinary differential equation and hopefully speaking you have added some Brownian term so BT here is a Brownian motion and this is called SDE what you can prove
and especially because you have assumed that the gradient of U is global Lipschitz is that you have a unique strong solution to this diffusion and there is under this assumption also what you may prove is that you have a single invariant distribution and the single invariant distribution
is precisely the distribution that you are willing to sample from it's a sample of the unique invariant distribution in that case you may prove that this distribution is reversible and therefore that this diffusion has a unique invariant distribution and this unique invariant distribution is exponential of minus U and you may even prove
that the convergence of the stationary distribution doesn't express that geometrical rate so typically it means that if you will be able to sample exactly this diffusion and if you are so you will start from some point if you are able to sample exactly this diffusion
after an amount of time a certain amount of time you will be close to the you will be close to the distribution that you are willing to sample and you know exactly you control exactly the closeness because you have extremely precise estimate on the rate at which you converge to this
stationary distribution you can use either you have functional inequalities which have been used for example by Cedric Villani recently like Poincare or Lok-Sogolev so transport inequalities which are based on either using the total variation or relative entropy techniques
or coupling techniques which are more probabilistic these two types of techniques allow to obtain extremely precise estimate on the rate at which the diffusion converts to pi so in fact conceptually this is very interesting
in the sense that it gives you a way to sample pi but there is a problem that you need then to be able to sample exactly the stationary distribution of a diffusion which is of course not completely possible because up to now we don't have
computers which are working in continuous time we can do only discretization so an obvious way to try to approach this problem is to try to do some discretization if you are trying to do some discretization of this diffusion so to discretize this diffusion
what you do is that the simplest scheme is to use a Euler-Maru-Yama scheme so it's like Euler in SDE it's simply a Euler for ordinary differential equation is simply a Euler scheme so what you do is that you fix in fact delta U over a certain amount of time so it would be gamma k
so typically here the step size of the diffusion is not necessarily constant this is why I use and we will see later that it's important not to take fixed step size so here the step size is not necessarily constant so this is why I've put k there and teach iteration you compute the next
you compute the next point xk plus one so you start from xk you compute the gradient of U at xk and then you add some amount of noise so it's zk zk plus one and it's a square root of two gamma k plus one zk plus one because it's an
increment of variance in the Brownian motion and it has a very familiar form because if you just forget this additional noise terms here but you simply have a kind of stochastic gradient descent so it's a stochastic gradient descent okay it's a gradient descent here okay and on which and you add some amount of noise
and an amount of noise which is not the amount of noise that you had there is an amount of noise which is proportional to the step size that you have performed and you obtain like this you obtain in fact you obtain for example if you take look at this so in fact the Euler-Mariama scheme
is a kind of gradient algorithm so it's a gradient algorithm and you have add on this gradient on each gradient step you add some amount of noise so it's very simple to implement it more or less one line of code as soon as you have been able to compute the gradient
so now if you look at the case where you have a fixed step size so if the step is fixed if you take gamma k equals gamma in the previous equation then you end up so this equation is simply an homogeneous Markov chain otherwise it's time in homogeneous but if gamma k is not constant it's time in homogeneous so but if you take gamma k
equal gamma then you obtain a Markov chain you can express of course transition probability density of this Markov chain which is given by gamma xy which is given there and what you can prove is that under some appropriate conditions you need a bit more than being
you need a bit more there to be I will give a precise condition later on but you need a bit more than being simply gradient Lipschitz you need a bit more condition so you need to have some kind of curvature at infinity you can prove that the Markov chain is irreducible so it's a continuous state space Markov chain so irreducibility is not exactly defined
as in a discrete state space Markov chain it's a bit more complex positive recurrence which means that you have a single invariant distribution and this invariant distribution of course depends upon gamma and the problem in fact is that you have lost the fact that your target distribution is
pi but since you have since you have used discretization you have a target distribution which is pi-gamma and the problem is that pi-gamma is not equal to pi that's our problem so if you use discretization like this you obtain at the end of the day an algorithm which is sampling an invariant distribution
which may even converge geometrically fast to this invariant distribution but the problem is that because you have as soon as you have taken step size gamma-k which is equal to gamma in fact the stationary distribution pi-gamma is not equal to pi so it's important just to just a word is that
it's simply here when you look at this in fact it's important to notice that it's important that delta u of xk is sub-linear at infinity because otherwise it's important for stability because imagine that delta u of xk is xk square when x becomes large of course it will be typically
not stable so the stability condition in discrete time and continuous time are completely different and in continuous time you may have and it's well known in physics in continuous time you may have steep delta u and with steep delta u you have a kind of confinement of the solution whereas
when you try to discretize of course if you have steep conditions, steep conditions are very bad when you are discretizing so in this case if you have steep potential it's no longer possible to use the Euler scheme so you need to use more sophisticated scheme which are called implicit scheme in order to
preserve the reducibility and the stability. Okay so there is one way to go but there is one way to go but I will not I think it's not necessary in that case so there is one way to go which has been proposed in
96 by Robert and Tweedy and Robert and Tweedy said okay so what I know is that I have a Markov chain if I took gamma k equal gamma so I got a Markov chain which is it doesn't have exactly the right distribution but what I can do there is a technique which is
due to metropolis and which has been later refined by Hastings which is starting from Markov chain you can correct you can correct the stationary distribution of Markov chain by adding a kind of accept reject a kind of accept reject scheme
and it works as follows so what you do is that at each iteration of the algorithm what you are doing is that you are proposing according to this distribution okay so you are proposing according to this step you obtain yk plus one okay there and what you are doing there is that you are you will accept or reject this move with certain probability which
is given there okay but the thing is that one of the problem with there are pros and cons so the right thing the right thing is that you have you have the right stationary distribution okay because then you can prove that this because you have add this
accept reject scheme then you have obtained an algorithm which is which become reversible with respect to the target distribution pi and not pi gamma so which is good but there are some cons so it requires to evaluate two times the objective function so it's not sometimes it's not necessary and there is a problem
of course linked with the convergence and there is a and which is not we not discuss that but in fact this condition are very difficult to check so in fact even in the simple geometry it's not completely clear that the metropolis these techniques preserve geometric ergodicity and there is
third but the difficult problem is that scaling analysis suggests that you have so in fact adding this adding this accept reject term force you to use very small step size and in fact the rate at which you convert to session IT are much smaller
than the rate that you convert with the technique I will explain a bit later okay so now I will I will try to explain I will try to explain what we are suggesting to do this and we will see that
we can tweak a little bit the discretization in order to obtain the right session rate distribution without having to use the mala correction so I need I need to I need first to to just
I need four results in fact to do this in order to and the first result in fact I need to have some stability condition for the diffusion so I need to have some condition which will guarantee in some sense that the diffusion will converge geometrically fast to its stationary
distribution so when I start with the diffusion what I will do I will look at the generator of this diffusion and in order to have in order to have a geometrical convergence to the stationary distribution I need to it's more or less an if and only if condition I need to find a Lyapunov function for the diffusion and
the Lyapunov function of the diffusion in fact is a function v which satisfies this condition so a the calligraphic a here is a generator of the diffusion so the generator of a Langevin diffusion is a Laplacian of f multiplied by the scalar product of the gradient of u a scalar product
with a gradient of f that's the generator of the diffusion and I need to find function v which is which satisfy what is called the Forster-Lyapunov condition so Lyapunov is of course did not invent that but Forster in fact understood that for discrete Markov chain it was extended to diffusion
this condition is more or less an if and only if condition if you want to have a geometrical convergence in total variation to the stationary distribution you need more or less to find a function v satisfying this condition as an example if you have as I said you need a little bit more
than being to satisfy a Lipschitz condition so what you need typically is that you need to have some curvature these are other conditions if you take v of x which is exponential of u of x divided by two, in Markov chain when you are looking at Markov process it's often the case that
if you take, say if you remember p x, the distribution that I bring to sample is exponential minus u of x so it's often the case in the Markov chain context that if you take v of x which is pi of x to a negative power it's a stationary distribution
to a negative power for negative power it's pi of x minus one half it's often a Lyapunov condition so in fact to check that you need a little bit more at infinity you need that delta u of x has some curvature at infinity if you have that, in fact you have an exponential
convergence of stationarity so you may prove that starting from a distribution mu zero which would have some constraint of mu zero p t is a semi-group associated with the diffusion pi is a stationary distribution which is a target distribution and you can control
in that case, you can control the difference in total variation between mu zero transported by the semi-group of the diffusion so which is mu zero p t minus pi which is less than c of mu zero kappa to the power t and you have you have a completely explicit expression that will not provide them
which depends upon the type of techniques that you are using, so you may use functional inequalities like Poincare or Loxobreff inequalities may use different types of couplings or synchronous couplings reflection couplings, so there are different techniques in order to obtain a completely explicit expression for c of mu zero and from
the exponent kappa. But at this level of understanding, you simply need to know that you have this type of you will converge, so the Langevin diffusion will converge to pi exponentially fast and you have a precise control on the way it depends
the rate at which it converts to stationarity and the way it depends upon the initial distribution mu zero. So now if I look at the discrete discretization of the diffusion I need to beat the same type of stuff as I said
it's a bit Markov chain so it will be time inhomogeneous if you assume that step size are not constant and the type of condition I need for r gamma in fact I need for r gamma a condition that which is also called a Fossela-Punov but it is a Fossela-Punov
this time in discrete time and the Fossela-Punov condition that I need is that I need to find a function v, it can be the same function v for the diffusion which is such that r gamma v which is you applied the transition kernel of the Markov chain r gamma to the function v should be less than
kappa to the power gamma v plus gamma v and in fact it's not a classical Fossela-Punov because now I have a dependency of this Fossela-Punov in terms of gamma which makes completely sense because if you assume that gamma here is taken to zero it means that r zero means that
the Markov chain is completely trivial it's xk plus one equals xk so you don't mix at all which means that if gamma is very small in fact if gamma is very small you will not mix much in fact so this is the reflected there, so if you take gamma equals zero you have r zero v which is equal to v, it's good and if gamma is larger of course you will
if you take larger gamma of course you will typically kappa to the power gamma will typically increase which means that you will increase so which means that you will contract v more so what you can prove then that you have the same type of stuff so you converge it's a v norm so it's a detail
there so it can be the tv in the total variation norm so the iterates of this Markov kernel will typically converge to pi gamma at the right which is your kappa to the power gamma so it's a Fossela-Punov but you need a little bit more, you need to be able to say it's a Fossela-Punov but it's more or less uniform
over gamma, in a range gamma less than gamma zero and the Fossela-Punov should have this type of behaviour. Okay, if you take exactly the same condition than before you have this so if I assume that u is L smooth and you have some kind of tail condition of that form you will obtain
exactly this type of condition and if I add all other conditions on u, if I assume that u for example is strongly convex at infinity I have better constants so then starting from this you may have the way you evaluate kappa and B in fact depends upon the type of assumptions you are ready to make
don't you? So it depends if you assume that u for example is convex and strongly convex outside a ball or strongly convex outside a ball but with bounded oscillation within a ball you will obtain different evaluation for kappa and B so it really depends upon this type of assumption but the weakest condition which leads to
the worst bound is that you have some kind of curvature at infinity. And starting from this curvature in fact what you can prove is that you have a control of the moment of v so q gamma here I think I did not put that
so it's simply q gamma so what is q gamma of n? It's simply equal to the product so it's p gamma, so r gamma is the transition function of my of the discretization so I have gamma k of the step size and q gamma is simply
p, so you apply p gamma n first and then p gamma n minus 1 okay until p gamma 1 so it's a composition of this and if you have assume, it's m gamma so and if you have assume that if you have assume that you have
a Fos-Taliapunov condition for this what you can prove is that you can prove that q gamma n of v, so q gamma n of v is what q gamma n of v is of x is the expectation for the discretization of v of xn so it's less than
kappa to the sum of the gamma plus this term okay so it's in fact the type of Fos-Taliapunov condition I have imposed for the discretization implies in fact a way to control the v moment of the discretization
in that form. Okay so now I will try to explain to what the way what I suggest in order to using this type of result to obtain a very
simple technique to sample the stationary distribution and I will the idea is very very simple as we see what I will do is what I am willing to do I am willing to control so that mu zero is the initial distribution q gamma p is a summing group
of the discretization, pi is the stationary distribution Tv is the total variation and I will try to control this difference okay so what I know is that pi is the stationary distribution of the diffusion and so what I do I can just explain the principle it's not the best way to
obtain the tightest boom but it's a way to explain why it works what I will do is that I will just cut this difference into two terms so the first term is written there so I just what I do is that q gamma p satisfies a kind of
Shannon-Konmogorov condition so q gamma p would be typically so you take any n and q gamma p would be it's a product of so it would be q gamma n and multiplied by q gamma n plus one p with obvious notation so it's simply the product of a kernel and what you do there is that you
replace what you do so here it's exactly equal to q gamma p and then what you do there is that you replace q gamma n plus one p you replace it by p of gamma n plus one p is written there so you replace it by the summing group of the diffusion and then you have these two terms so it's simply these two terms
vanish so now you simply have these terms here now if you look now at these two terms so this term in fact is easy to control so this term is easy to control why because it's exactly you look at the summing group of the diffusion which is run during a period of time which is gamma n plus
one p so gamma n p is given by this and with a stationary with an initial distribution which is mu zero q gamma n so it's a mu zero q gamma n mu zero q gamma n is the initial distribution which is transported by the summing group of the diffusion and you look at the rate at which it converts to pi
this term is easy this term is simply the terms that you have is simply the control of the rate at which the diffusion converts to this stationary distribution which gives you exactly precise estimates of the rate at which you are converting to pi starting this time from mu zero q gamma n and then you have a second term to control
and this second term in fact controls what? so you start now at time gamma n you start at gamma n and you compare so you start at gamma n with the same initial distribution with mu zero gamma n and you will now compare what happens
when you transport this measure with the summing group of the diffusion or if you transport this measure with the summing group associated with the discretization of the diffusion so it's exactly this term so for that term of course we already have the right
tools to handle this term and now we need to have a technique to control this term so if you look now at this term what you will have to do in fact you will have to do something which is very simple you have to control you can reinterpret that
in fact like controlling so if you take two diffusion which are driven these two diffusion are driven by the same Brownian motion and this diffusion in fact evolves with a different potential so the first diffusion evolves according to the Langevin potential so which is the gradient of u and the second diffusion evolves according to the
discretization of the Langevin potential which is delta u bar which is given there so it's a discretization of the Langevin potential and which are given with the same with the same which are driven by the same Brownian motion so what you can prove it's a simple technique, it's a simple trick
is that these two diffusion in fact are mutually so the law of the diffusion yt on the path space they have the same because you have the same Brownian terms here the law of this diffusion is
absolutely continuous with respect to the law of the discretized diffusion and there is a theorem which is famous theorem in stochastic calculus which is due to Girzanov which gives an exact expression of the Radon-Econim derivative of the law on the path space of the diffusion
with respect to the law of the discretized diffusion and it's given given by the Girzanov theorem so the Girzanov theorem tells you that you can compute on the path space you can compute the density of the law of the path with respect to the law of the path of the discretized
diffusion and it's exactly given by this term so that gives you the expression more or less without technical details but you can imagine that this quantity is a Radon-Econim derivative it's a Radon-Econim derivative of the law of the diffusion
on the path space with respect to the law of the discretized diffusion so now you have a density and you are willing to control the total variation distance starting from some distribution, say for example data x
of this quantity so what you will do is that you will do simply the Pinsker inequality which is famous in information theory so Pinsker inequality tells you that you can control the total variation distance between these two distributions by computing the
entropy of the Kullback-Leibler distance between the law of the diffusion and the law of the discretized diffusion so it's a relative entropy of these two distributions so there are generalizations
in Pinsker you can do better than that but that's the simplest way to do so it's a good control it's a good control the inequality is quite tight, in fact it's a lower cost, it's not necessarily very tight otherwise but the lower cost is a good inequality otherwise you use phi the other definition of entropy
which can be used but in that case it's extremely simple because if you look at this and if you plug in this Pinsker inequality if you plug the expression of this density ratio you obtain exactly this what you see there is that you simply have to control
you simply have to control the second moment with respect to the law of the discretized diffusion you simply have to control the difference between the gradient of u and the discretized gradient of u and you have to control this quantity with respect to the law of the discretized diffusion which is very simple so you have a closed form expression but it's not a closed form
of the difference between the discretization of the diffusion and the diffusion itself and the discretization and the good thing is that you simply have to control these moments and you have simply to control these moments with respect to the law of the discretized
of the discretization which is quite easy if you have in mind that thanks to this inequality you are able to control moments of this type of quantity so at the end of the day what you obtain you simply plug now the control of moments
that we had before and you have a completely explicit control of this quantity so what are the quantities which are appearing in this equation it's d which is the dimension of the space so you have an explicit dimension in d you have an explicit expression in terms of gamma k and this quantity can be bounded using the
and L of course because you have a difference there L is the least constant of everything that I've done there works because I've assumed that U is something smooth and if I just put everything together if I put everything together
I get an explicit bounds which look like that and N is the kind of points which I can choose so this part depends upon the way to control this system the second part is the way to control this difference there
and I didn't show that on the slide but you have a completely explicit expression for this quantity in fact all these bounds are quantitative and can be computed and you have an explicit for example I did not discuss that but it's important in our context to have an explicit control on the dimension
so we have an explicit control on the dimension and if I assume that gamma k goes to zero in fact with some additional assumptions there but basically if I assume that gamma k goes to zero at any rate in fact almost any rate but sufficiently
in such a way that the sum of gamma k goes to infinity in fact what I can prove is that I can control mu zero q gamma p minus pi and total variation goes to zero as p goes to infinity so what I have proven there is that of course, so that's a message that I want the message and I want to prove
the proof is quite easy the message is that there is a way so it's a bit surprising so if I simply take a diffusion so I have a diffusion which is the right stationary distribution which is pi of course if I use the Euler discretization with a fixed step size I will end up with a diffusion
I will end up with a Markov chain which doesn't have the right stationary distribution but if I use decreasing step size provided that the sum of gamma k goes to infinity I end up with my discretization we still will converge to pi and I can exactly control in fact the rate at which I converge
because everything is explicit so I have some time problems to convince my colleague because it seems it's not completely obvious to understand why if you use a fixed step size you will never converge to the right distribution whereas if you use decreasing step size you will converge to the right distribution and on the top of that you can have explicit control
you can have explicit control on all these quantities and of course in our case I think we so there is a huge literature about the discretization of Langevin diffusion so I should have mentioned that but the thing that we also the thing where we are a bit more careful than other authors
is the way so we have tried to control explicitly how all these bounds depend upon the dimension D because we have in mind that we are willing to sample over a large dimensional space ok so there is also results using this type of technique
you can also see what happens so for example you can control how pi and pi-gamma are related but I will not explain that ok I should have mentioned that this type of results there are many many results which have been obtained by Don Itale a long time ago
with different application in mind not really with the application of discretization of stochastic differential equation not necessarily for the simulation purposes but he did a lot in this area and in particular he has obtained this type of results for even more general diffusion
but not the result before ok so now and this is more or less the end of my talk but now what happens what is essential in what I've said what is essential in what I've said before is that there are not much assumptions but at least there is an assumption which is
which is important there it's even important to have a unique solution to the diffusion problem so is that you obtain the potential U was differentiable in fact for example the this type of problem will not be applicable to the case where
U is not differentiable of course and which is exactly what happened when you have for example which is a typical case in machine learning where U in fact you want to sample P of X which is exponential of minus U of X and in machine learning you have a lot of case where
the potential U of X is F is a nice function okay and you have G of X which is convex but nasty otherwise so it's a convex function but it's non-regular so there are many many examples of this
so for example the famous or infamous lasso type of techniques G of X is L1 so it's convex but non-differentiable when you use support vector machine which has been introduced by Vapnik a long time ago
F of X is what is called the Inge loss so it's an Inge function which is now very popular because they use Inge functions in deep learning also now so it's not differentiable at one point okay so there are many problems where where you can you are willing to
to look at this so you have a kind of composite function where F is smooth and G is convex but not smooth okay so we tried so it's an important problem also to have solution for this and so this is exactly what I will describe now so what I will do in fact I will use
there something which is closely related to what people do in optimization and it's called Moreau it's a French mathematician invented that in the 60s what we do is that he has proposed
to do he has introduced a kind of he has introduced a kind of regularization of function but which is in that case it's the right way to do the regularization which is called the Moreau envelope and the Moreau envelope is defined as follows so what you take? You take a function H
and the Moreau envelope so assuming that H is convex for example in that case it's not necessary but assuming that H is convex the Moreau envelope is defined like this so it's H lambda of X lambda is a regularization term and H of X is simply you take H of Y and you add the regularization
so the regularization is X minus Y to the power square and it's 1 over 2 lambda here so the function is regularization is always below H of X of course it converges to H if lambda goes to infinity so you are solving in fact this problem so in the case where H is convex so H of Y plus 1 over
2 lambda X minus Y square in that case is strongly convex so he has a single optimum so in this case you may define H lambda of X which is the Moreau envelope otherwise it's a bit more complex to define
but it's not needed at that point and the Moreau envelope is characterized so the point at which the minimum of this term is achieved is called the proximal
term and so it depends upon H and X which is characterized by this quantity which is a sub-gradient of this and you have a closed form characterization and the thing which is important to know at this point is that for many functions G
In fact, for me, function h, there is the moreau, so the proximal operator is none. And sometimes it's extremely simple. So if you look at if h of x is, for example, the sum of xi from i equal 1 to n, so which
is a lasso penalty. In fact, the proximal, so the prox lambda h of x is simply the coordinate-wise, it's a
coordinate-wise threshold operator, it's a coordinate-wise smooth thresholding.
So basically what you do is that, what is a smooth thresholding, is that simply is this function. So you put here, it would be lambda, it should be minus lambda there. So you put there, so it's really 0 there, and then it's x minus lambda or x minus lambda.
So in fact, what is a proximal operator for this function, the proximal operator is completely simple, it's simply this. And it's related, of course, it has not been noticed first by people. So if you have followed what happened in statistics since 30 years, so of course, nonlinear
thresholding was one of the main theme with people doing functional approximation with wavelets and things like this. So in fact, it's related. So this is related to lasso, this is related to L1, compress and things, so everything is related. And it's related to the fact that, so L1 is always behind this, and the proximal
operator of L1 is the smooth threshold operator. But it has not been noticed that when you rewrite the story, you can reinterpret everything with this type of way, but it has not been noticed, of course, first in that way. Okay, so the thing to know is that you have some properties, of course, and so the
regularized edge of lambda converged to H is smooth, and there is a way to construct its gradient, and its gradient can be computed like this, so it's very easy in such case.
So to make a long story short, so what you do, what we are proposing in a paper, so with Alain and Marcelo, is that we are, what we are doing is that we are taking our potential u of x, and what you do is that we take f of x, and then we regularize the convex-ponesty part, G lambda, and we use the Mauro regularization.
Okay, what you can prove, it's a theorem, but it's not very difficult to prove, right, not completely easy, but if G of x is proper, okay, if it's a proper distribution, exponential of minus G lambda of x is also proper. So in fact, if exponential minus G of x is integrable, then exponential minus u of x
is also integrable. Okay, and we have a proper control, so what we can prove is that we can prove that pi lambda converts to pi, so pi lambda is a distribution according to this new potential,
converts to pi when lambda equals to zero. We have an explicit control, which is simple, in total variation between pi lambda and pi, forget the third term, because it can be used to solve another problem, which is something from convex body, but that's another issue. And the algorithm, which is called MAIULAS, for Mauro-Ucida
Unnormalized Langevin Algorithm, it's simply what you are doing, is that simply you are running with a decreasing step size, discretization of the Mauro, it's a Langevin discretization of the potential,
which has been regularized using the Mauro envelope. So what you do is that you compute the gradient of f xk, and that's the gradient, what is the gradient of the non-smooth part after regularization, the gradient of the non-smooth part given by that term. So that's an explicit, that's the discretization of the diffusion that you are
using, and with a decreasing step size, and you obtain, sorry, so there are other techniques. Just a small illustration, the illustration which is given there, so it's linked,
it's a simple one, but it's easy to capture, so I've chosen this one because it's easy to understand. So the problem is very classical, so it's a very classical problem, it's a Bayesian inverse problem in periodic dimension, so it's a dimension,
x is an image, so it's an image which is observed through h which is called a blur operator, so it's a kind of convolution operator, so it's called a blur, so it's known in that case, h is known, it's imposed in the sense that h is extremely low pass in that case, which is typical also, and you add some amount of noise, w, so what you observe,
in fact, you observe that, you observe an image y, this image y is a given object x, which is an image that you are willing to do, which is observed through a point spread function h, okay, and you add some amount of noise, which is assumed to be Gaussian, okay.
And what I'm willing to do, I'm willing to restore x, so it's, as I said, it's a classical problem, but at least you may see something, you want to restore x, and you add some prior, and the prior information that you use there, as a prior observation, you use a prior,
which is given by this, so here delta d is called the discrete gradient operator, it's a discrete gradient operator in the sense that what you are willing to favor is that the prior information that you have is that if you have a natural image,
it's called the total variation regularization, so what you are, you are taking an image, so an image is a lattice of pixel, and your prior information is what, is that if you take the discrete gradient, which is, you take the difference between pixel, row-wise and coordinate-wise, and you assume that in a natural image, in fact, the sum of this
difference should be small, so which means that you have, when you look at an image like this, you have a lot of more or less homogeneous areas separated with, of course, with contours, and so what you expect is that you have a smooth, when you do this, you have a kind of,
so you have a smooth, a small, in fact, the sum of the difference of the gradient should be small, which is true in natural images, okay. So what you are willing to sample, you are distribution, so y are the pixels that you have observed, x are the pixel of your, x are the pixel of your, the true image, okay. So you have a term which is simply
the likelihood of the observation through this model for observation, okay, plus a term which is a prior information on your image, which is given there. So I use a prior, so of course, it's always a bit difficult to see that, but so here it's, that's a true image,
okay. So look at this, it's, so in fact, the number of points that we'll restore, it's of order one, two, two, so it's more, it's something like 10 to the power of five,
okay. So it's, I am in art dimension that the blurred image, it's blurred because, it's blurred because of the blurring operator h, and they have some amount of noise, that the map estimate, okay. And that, what I have obtained there are credibility intervals, so it's just the only thing I can display there. Credibility intervals,
I am looking at the distribution, so now I've sampled the image, okay. So I compute pi of x, so I compute pi of x given the observation y, and credibility intervals are what? I'm looking coordinate y, so x is, it's a large vector of pixels, okay. And I am looking at displayed
credibility intervals, so which are more or less here, for pixel wise, the zone in which I received the 90% of the distribution. You can see here, it's a, so the technique which is
standard, take something like, so we have our techniques to obtain the proximity intervals, so the technique which you, the pesico techniques, you spend more than one day, so now our technique is, with our technique is something like 20 minutes, which is a bit better.
Okay, so there are a lot of work which remain to do, so there are some work which has already done, so I did not say that, so we have computer bound for convergence, so it is done, I did not explain this, so we have, we can compute MSE, so the MSE of estimates,
deviation inequality, so we have results in these directions. So there are future work, something which has not been done yet, so partial updates, so a partial update, because when you are invalid dimension, in fact, sometimes using a full gradient is not the right way to do, so it's much better to update group of parameters by group of parameters,
but it's not that easy to do proof with partial updates, so perhaps we will do that later, so we have to work also to include more complex processing using priors, a more detailed comparison with MALA, trying to understand why there are so much different,
we observe a lot of difference, so it's, our method works much better, but we need to explain why, and also we have to look at, there are some bias reduction techniques to even speed up the convergence, which is more or less comparable to, it's reminiscent to something
which has been done in SDE to remove bias, which is called the Romberg techniques, where you use, in fact, the idea is to try to, you will use Euler discretization, but with two different step sizes, and you will then recombine, you recombine then the estimate, so it is well known
how to do that when you do, you estimate, for example, a smooth quantity of diffusion, but it's less clear how to implement that when you are trying to sample the stationary distribution of the diffusion, but we have some results already not explained there. Thanks for your attention,
and I hope there would be some questions. Okay, so are there some questions? Okay, we now have a question. I have the feeling that, I mean, the continuous case, things
are quite clear, and whenever you start descriptizing, the gates are open, and anybody is trying different techniques on how to discretize and mix, is it the case? Yes, that the difficulty comes from the theory in continuous time, it's even, the theory in continuous
time is completely, it's very elegant because it's based on, so you have extremely precise results on the rate of convergence, and very elegant because there is a natural language PDEs, and so you can say a lot of things, and of course in discrete time, it's more complex. So my question is basically, let's say differently, the biggest problem is how to
implement on the computer a continuous SD or whatever, so do you think there are other ways in the evolution of going directly to analog computer? Why not? Because I have the feeling it's the digital world which poses all these problems for you.
Yeah, and if you look at even what has been done in the 60s, people were using analog computer, but of course it's doable. You can emulate on a, it was done, for example, in the 60s, it was used, people used analog computing to solve a system of differential
equations, so it's doable. So it's costly, but doable. But the discretization is a non-trivial issue. It's clear that it's a non-trivial issue in this case. I try to convince you that it's, okay, that's difficult, but it's not so difficult.
But we have the feeling that going from continuous to discrete case, a lot of people try many recipes. I mean, are we going to find a general framework to do it? Because, I mean, every time I go to conferences and see, each one tries a different recipe, it says I'm going to do the earlier discretization, another guy this type of discretization, and it can last.
I mean, in that case, in fact, I think, yeah, there are many, there are different issues. In fact, so there are the potentials, for example, which are steep. It's a different problem, so I did not mention that. So what happened when you have steep potentials?
So then you need to be more careful, because then you need to have a kind of implicit scheme, so you cannot use explicit schemes. So it's not only, yeah, I think here in that case, we have a reasonably simple way to do, okay. So there are many, many other problems which are linked with
how to use. It's very difficult. And the other thing which is interesting to me, and this is why I'm trying, so for people which use machine learning, they have already implemented gradient descent, because gradient descent is a workhorse of any machine learning techniques right now, as I mentioned. And the good thing to do in our solution is that you simply have to
change one line of code. If you look at our discretization, so you don't have to do an accept-reject step or to implement other things, you simply have to, you take your favorite gradient descent algorithm and you add some amount of noise. And then the only thing
that I try to advocate is that you also need to tune the step size a little bit in a different way. So at least when I go to the machine learning community, it's so simple to modify the code that I convince people to at least try to do this, you know. And it becomes especially
interesting. So if you look at the literature in machine learning in the last 20 years, so people, when people start to look at large emotional problems, they start to look at, there was a trend to look at convex problems, like starting with work by Bapnik, which was
first to popularize the support vector machine. Then there was a lasso and things like this. And there is a constant stream of literature where people say, okay, if we look to large dimensions, we should have simple models which are convex. And then they have the single maximum and there
are a lot of things that you know on how to perform convex optimization, even in very large dimensions. And it has been successful, but until five years ago when the deep learning reappears, so neural nets reappears, so you are really in a non-convex case. And in fact,
there is something good with this type of stuff is that when you add some noise in, when you take stochastic gradient descent algorithm and when you add some noise, you also, you have a very large benefit because you are not, gradient descent algorithm is typically stuck in the first minimum that he's looking at, that he's finding. Whereas when you start to use some
amount of noise, then you escape also from traps. So it's more like a global optimization technique. So there is something good and it's funny because, it's funny, not so funny, but it's funny that if you look at the first context which has been performed in machine learning community with neural nets, it was not deep nets, but it was neural nets with architecture which are not
so far from the deep nets. And the first technique which has won was a technique based on this type of ID. So it was based on MCMC and it was, I don't think that it's really because it was by reason, I really believe that it was successful because the technique at least
escaped from the obvious local minimum. And I think there are some other benefits, so I don't want to pretend. So I see two main points in my presentation. The main points are, first, if you are doing any algorithm, with wide-angle code, you are Bayesian,
so it's a good message. The second message, if you start to look at non-context optimization problem, this adding some noise makes it extremely robust because when you start to add some noise, even you will escape at least. Look, I'm not sure that, so of course, I don't guarantee that
you will go to the global optimum, but you will escape to the annoying local minimum. So I think this is why, if you look at the literature now, there are more and more techniques which have been proposed which are at least in this period, and we are to that.
Other questions? Yeah, I have one from an applied perspective. So I like this image reconstruction algorithm you showed, and does it always have to work on the complete image or can this be parallelized, for instance, if you work on blocks? Yeah, I did not do that, but yes, it can be. There are what they call consensus Monte Carlo and some techniques. So there are
techniques of that. I did not work on that, but yeah. So it will not break the... No, no, I did not look at this, but there are a lot of literature where you do sample splitting and things like this, so I did not do that, but yes, it's doable.
Yeah, another question. I will not go into the debate of BAs and Orthodox, but one of the feelings we have is the choice of the prior that you're doing here, which I mean, you used here a Laplace prior, I remember. What was the impact of the choice of the prior?
No, it's not being... So it's not exactly this point because when you do classical machine learning, you have two terms. You have a regularization term, okay, which it's not about the prior. So you have a likelihood term, so there is a likelihood,
so we all start with likelihood, and in machine learning, you always use the regulation term, which is... So everyone in machine learning is Bayesian. So there is no
Orthodox machine learner because as soon as you use a regularization term, you are Bayesian. So the main difference is whether... So now I have a function. When you are not Bayesian, I say, okay, I have a function which is the sum of the likelihood regulation term, and what I do, I compute only the maximum.
And when you are Bayesian, you say, okay, we sample the posterior distribution, so I will... In fact, when I sample the posterior distribution in this type of dimension, what I show is that I have many, many solutions which are value, which are extremely close to the maximum because the funny thing is that there is something funny, which there is something funny
intuitively, everybody knows, of course, is that you start with a badly... You have a problem, the problem is badly conditioned. So of course, the use of the regularization gives you one solution, of course. But the thing is that it's become extremely sensible to
regularization. So if you just report one point, you don't see that there are many, many other points which are very, very close, and they can be very different in terms of the solution, which can be very different, but if you look at the value of the criterion, they will be extremely close. So when you are reporting a single point, like its maximum,
you don't see, it can be extremely flat, and it's typically in large dimension, you are extremely flat. So which means that there are many, many, many solutions which achieve almost the same value of the objective. And you see that because people, if you look in machine learning literature, they have a lot of literature where they are
trying to optimize the amount of regularization by cross-validation and synthesis. And when you start to play with this, especially when you are using L1 lasso-type techniques, you will obtain many, many different solutions because a small perturbation on the regularization trigger the operation of completely different solutions. Then how do they hint the intuition of the regularization term?
You want to promote, because when you have a problem which is in conditions, you need to have a prior information in order to solve the problem, otherwise... Of course, but how do you choose it? Because, okay, you said because it's sparse, so I'm going to take... Yeah, yeah, that's typically what people do, for example, when they do sparsity, so they say,
okay, I have many regressors, but my assumption is that I know that my assumption is that there are only a small part of the aggressor which have meanings, so that you will use... L1 industry. Yeah, in the L1 or sparsity inducing priors, like L1 is one of the many, many possible priors that you can use. The L1 is not, now it's not, L1 is old-fashioned, but
you will use a prior which is promoting, prior which is trying to promote sparse solutions. But of course, in India's problem, you always have either smoothness, so it's in PDEs, it's more smoothness or assumption or sparsity inducing assumptions. So it's not about being orthodox
versus Bayesian. The problem, no, no, no, no, no, it's... No, you cannot do orthodox machine learning. No machine learning. So as soon as you go to machine learning, when you are in, as you mentioned, you are non-parametric. So when you are non-parametric,
you need to have assumptions. Yes, a question there. So if L1 is old-fashioned for sparsity, what's... SCAD or there are many other type of penalty which are... Lasso is old-fashioned or not? Lasso is a bit old-fashioned, yeah, yeah, yeah. Lasso is old-fashioned, yeah, yeah,
Lasso is old-fashioned. There are many other penalty which are... There is a problem with Lasso. So the problem with Lasso is easy to understand, is that when you have a... It bias the solution. So it bias the solution because if you have a large component which is there because of the L1, so because you... In order to favor
sparsity, you have to increase the regularization term. If you also have to increase the regularization term, it tries to contract. So it's linked with the Lasso. It contracts the true value of theta. So it bias the solution and very significantly. So typically, what people want to do, the right penalty, the right penalty are something like this.
So of course, you want to eliminate the small... But at the end of the day, you want to have something which is more something like this. So when theta is very large, you don't want to shrink. You don't want to shrink the large values. So when theta is large, what you are willing to do that you are willing... So typically,
the SCAD penalty will do that. So it will just eliminate... It shrink to zero, the small values. But when theta is large, you don't want to penalize them. So there are many, many... Of course, the right penalty is L0.
So if you are insisting to being convex, L1 is the last stop before being non-convex. But if you are ready to have a non-convex experience, then you can go much closer to L0. And yeah, that's it.
The questions? Okay, I have just a small comment. You have to know that for some reason, analog is coming back everywhere. Even in our disciplines, frequencies are getting higher. Is that possible? In that case, it will be easier. In fact,
you simply have to compute gradients. Sometimes it's not that complex. No, no, that's a good idea. And there are many source of Brownian motion through it. Okay, I think we'll take the break right now. There's a lunch. So let's thank the people.