Multiscale Models for Image Classification and Physics with Deep Networks
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 5 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/53799 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Data modelComputer-generated imageryPhysicsComputer networkNormal (geometry)Regular graphMachine learningPixelFood energyPhysical systemLinear mapArtificial neural networkConvolutionHausdorff dimensionFormal languageException handlingMathematical analysisHarmonic analysisMaxima and minimaMusical ensembleLink (knot theory)WordMathematicsArtificial neural networkProbability distributionFamilyComplex analysisDifferent (Kate Ryan album)Series (mathematics)Filter <Stochastik>Sampling (statistics)Reynolds numberTurbulenceStochastic processMultiplication signConvolutionSummierbarkeitPublic domainPhysical computingFunctional (mathematics)Mathematical optimizationCombinational logicParameter (computer programming)Point (geometry)Order (biology)Field (computer science)Vector spaceHarmonic analysisSocial classState of matterSpeech synthesisGeometryTranslation (relic)Physical systemPattern recognitionElement (mathematics)Type theorySingle-precision floating-point formatApproximationSoftwareMedical imagingMathematical analysisCartesian coordinate systemRegular graphDimensional analysisPhysicalismAlgorithmDerivation (linguistics)StochasticLimit (category theory)CASE <Informatik>Internet forumTheory of relativityFunction (mathematics)CoefficientResultantCategory of beingComputer architectureComputer-generated imageryInformationSparse matrixRange (statistics)RectifierData structureGradient descentFood energyMultiplicationForcing (mathematics)Core dumpComputerNeighbourhood (graph theory)Linear regressionElectric generatorCentralizer and normalizeroutputComplex (psychology)2 (number)View (database)Shared memoryNumberQuantumNeuroinformatikRegression analysisGradientRepresentation (politics)DiffeomorphismData modelFundamental theorem of algebraLecture/ConferenceComputer animation
08:54
Correlation and dependencePixelComputer networkLogic synthesisAutocorrelationStatisticsData modelErgodentheorieComputer-generated imageryPhase transitionWaveletRegular graphSparse matrixProcess (computing)Entropie <Informationstheorie>Endliche ModelltheorieMaxima and minimaQuantumLinear regressionFood energyGraph (mathematics)Hausdorff dimensionReduction of orderGroup actionMotion captureHarmonic analysisMathematical analysisMereologyImaginary numberSpacetimeWavelet transformFrequencyGroup actionWavePixelBuildingInteractive televisionDifferent (Kate Ryan album)Particle systemTrigonometric functionsData modelRotationProduct (business)Public domainMotion captureWaveletDimensional analysisZeitdilatationAngleConvolutionPearson product-moment correlation coefficientTurbulenceStochastic processStatisticsWeißes RauschenMedical imagingCross-correlationComputer-generated imageryLevel (video gaming)SoftwareCategory of beingPhysicalismFamilySimilarity (geometry)outputCoefficientWeightPosition operatorPoint (geometry)Theory of relativityMathematical optimizationSpacetimeMultiplicationPhysical systemSingle-precision floating-point formatRegular graphTerm (mathematics)Sparse matrixFood energyProcess (computing)Type theoryFunction (mathematics)Order (biology)InterpolationDatabaseVideo gameComponent-based software engineeringRepresentation (politics)Multiplication signWordAverageSet (mathematics)CurvatureField (computer science)Core dumpBasis (linear algebra)Nonlinear systemMathematicsTwo-dimensional spaceRegression analysisVortexSemiconductor memoryLogic synthesisMassFlat manifoldDiffeomorphismCentralizer and normalizerErgodentheorieCodierung <Programmierung>Computer animation
17:12
MereologyWaveletSpacetimeFrequencyWavelet transformCross-correlationDigital filterCorrelation and dependenceSparse matrixEndliche ModelltheorieTurbulenceMaxima and minimaMoment (mathematics)Entropie <Informationstheorie>Distribution (mathematics)Execution unitRandom numberRange (statistics)Motion captureRectifierPhase transitionFourier seriesUser interfaceTheoremWindowZeitdilatationNonlinear systemKorrelationsmatrixGraph (mathematics)Similarity (geometry)Harmonic analysisEndliche ModelltheorieWavelet transformPhase transitionStochastic processStatistical physicsProbability distributionWordWaveNormal distributionDifferent (Kate Ryan album)Bilinear mapPublic domainAngleDifferential equationFourier transformFourier seriesInformationKritischer ExponentPrinciple of maximum entropyCoefficientCalculationFrequencyMoment (mathematics)Field (computer science)Cross-correlationTransformation (genetics)Differentiable manifoldConfiguration spaceWaveletSoftwareMedical imagingTheory of relativityCASE <Informatik>Data structureConvolutionOperator (mathematics)Statistical dispersionAlpha (investment)RectifierComponent-based software engineeringFunction (mathematics)ExponentiationWindowPower (physics)Renormalization groupInteractive televisionNonlinear systemDirection (geometry)Gibbs-samplingHarmonic analysisMaxima and minimaDimensional analysisDistribution (mathematics)Binary multiplierSpectrum (functional analysis)Order (biology)GeometryStatisticsData modelVarianceTranslation (relic)TurbulenceEstimatorMathematicsComputer-generated imageryGraph (mathematics)Orientation (vector space)Musical ensembleIsing-ModellPoint (geometry)Link (knot theory)Filter <Stochastik>Process (computing)Trigonometric functions1 (number)OvalCondition numberRenormalizationKey (cryptography)Linear mapCanonical ensembleAverageNegative numberComputer animationLecture/Conference
25:29
Local GroupIsing-ModellHTTP cookieNumerical digitRegular graphVideoconferencingComputer networkInformationLipschitz-StetigkeitDiffeomorphismContinuous functionTheoremInvariant (mathematics)RotationForcePopulation densityFood energyPhysical systemTelecommunicationThermodynamic equilibriumDirac equationRegression analysisArtificial neural networkPhysicsLinear mapSchwache TopologieAtomic numberPopulation densityPhysical systemComputer-generated imageryAngleCalculationRotationDimensional analysisCategory of beingFood energyAstrophysicsProcess (computing)Order (biology)Social classKey (cryptography)Element (mathematics)Functional (mathematics)DistanceGroup actionCoefficientMatter waveAtomic nucleusRepresentation (politics)Core dumpMachine learningBitOrbitSoftwareRectifierDiffeomorphismBlock (periodic table)ConvolutionLoginRegular graphAverageNumberSpacetimeRenormalization groupStochastic processWeightCombinational logicTranslation (relic)Level (video gaming)InformationInvariant (mathematics)Wavelet transformDensity functional theoryDigitizingPosition operatorState of matterSet (mathematics)DatabaseConfiguration spaceWaveletTelecommunicationNichtlineares GleichungssystemSoftware frameworkIntegrated development environmentFrequencyPhysicalismZeitdilatationFilter <Stochastik>Function (mathematics)Operator (mathematics)Complex (psychology)Ising-ModellTurbulenceType theoryNonlinear systemMorphismusOvalVector spaceDifferent (Kate Ryan album)Artificial neural networkTheory of relativityComputer animationLecture/Conference
33:46
Regression analysisFood energyPhysicsLoop (music)Error messageArtificial neural networkDensity functional theoryBasis (linear algebra)Linear mapRotationInvariant (mathematics)Correlation and dependenceTexture mappingSocial classNumerical digitComputer-generated imageryLinear subspaceData dictionarySparse matrixThermal expansionPattern languageSign (mathematics)Convex setAlgorithmGradient descentHomotopieGradientExponential functionHypothesisSpärliche CodierungComputer networkInsertion lossNumerical analysisDeep WebMatrix (mathematics)Linear regressionStochastic processFirst-order logicMathematical modelCASE <Informatik>Filter <Stochastik>Order (biology)MathematicsType theoryFunction (mathematics)Vector spaceDatabasePattern languageCoefficientGoodness of fitRectifierField (computer science)Normal (geometry)Pattern recognitionSoftwareTerm (mathematics)Convex optimizationDifferent (Kate Ryan album)Optimization problemData dictionaryApproximationGradient descentError messageDensity functional theoryProjective planeRegular graphLevel (video gaming)Category of beingBinary multiplierMedical imagingWaveletSparse matrixNP-hardArtificial neural networkAreaWeightResultantMereologyPublic domainSummierbarkeitData structureNumbering schemeProduct (business)Limit (category theory)NumberTransformation (genetics)Atomic numberComputer-generated imageryDigitizingTexture mappingSocial classPoint (geometry)View (database)Representation (politics)GradientAlgorithmVideo projectorStandard deviationInsertion lossFiber (mathematics)Food energyWaveIntegrated development environmentLinear mapLecture/Conference
42:04
Error messageComputer-generated imagerySocial classSpärliche CodierungComputer networkStatistical dispersionReduction of orderErgodentheorieZufallsvektorContinuous functionDiffeomorphismGroup actionLipschitz-StetigkeitTheoremGAUSS (software)Limit (category theory)Linear mapInverse elementVideo projectorInversion (music)Data dictionarySparse matrixCoding theoryStochasticGradient descentGradientLevel (video gaming)Cross-correlationMorphismusGroup actionAveragePublic domainDimensional analysisInversion (music)InformationData dictionaryRepresentation (politics)Sparse matrixIntegrated development environmentStochasticOrder (biology)SoftwareComputer-generated imageryBitMereologyDrop (liquid)RectifierNonlinear systemOperator (mathematics)Element (mathematics)Translation (relic)GeometryRotationNoise (electronics)Statistical dispersionVector spaceEndliche ModelltheoriePattern languageWaveletSet (mathematics)CASE <Informatik>Artificial neural networkError messageSpacetimeConcentricFunction (mathematics)Codierung <Programmierung>Theory of everythingReduction of orderRegular graphCartesian coordinate systemCategory of beingLinear mapGaussian processGradient descentWeißes RauschenNormal distributionCentral limit theoremPhase transitionCodeQuantum stateMultiplication signSoftware testingMedical imagingDatabaseInterpolationCoefficientStochastic processRandom variableDiffeomorphismInvariant (mathematics)Cellular automatonGoodness of fit1 (number)Computer animation
50:15
Random numberNoiseCurve fittingEndliche ModelltheorieProcess (computing)StochasticBitStochastic processDifferential (mechanical device)Matrix (mathematics)Type theoryStatistical dispersionGeometryDifferent (Kate Ryan album)StochasticCoefficientWeißes RauschenLogic synthesisEntropie <Informationstheorie>SpacetimeRandom number generationTheory of relativityEndliche ModelltheorieComputer animationLecture/Conference
51:47
NoiseoutputData modelEntropie <Informationstheorie>Harmonic analysisMaxima and minimaComplex (psychology)Artificial neural networkComputational physicsMathematical analysisFunctional (mathematics)Endliche ModelltheorieApproximationTheoremComputer networkPhase transitionHomotopieStatistical dispersionData dictionaryFluidComputer simulationTheory of relativityDynamical systemDifferent (Kate Ryan album)Turing-MaschineCorrespondence (mathematics)ExpressionPoint (geometry)MathematicsView (database)Complex (psychology)Data structureComputer programWeightApproximationoutputData modelNoise (electronics)CodeWordAlgorithmSoftware bugRegular graphLine (geometry)Sparse matrixPhysicalismCoefficientPhase transitionTheoremSoftwareFunctional (mathematics)GeometryCASE <Informatik>Social classSet (mathematics)Limit of a functionOpen setTuring testData dictionaryOcean currentComputer animation
54:08
Software testingComputer-generated imagerySet (mathematics)Lecture/Conference
54:54
AngleNumberTranslation (relic)Graph (mathematics)Social classSpectrum (functional analysis)Group actionCompilation albumOperator (mathematics)MathematicsPredictabilityData modelTerm (mathematics)Universe (mathematics)Principle of maximum entropyTheoryBit rateSoftware testingNormal (geometry)SoftwareSingle-precision floating-point formatInformationMatrix (mathematics)MereologyError messagePoint (geometry)View (database)Escape characterLogic synthesisData dictionaryNeuroinformatikWaveletAngleCoefficientMachine learningStatistical physicsDifferent (Kate Ryan album)Pattern languageStandard deviationSoftware frameworkCross-correlationApproximationMathematical optimizationEquivalence relationRandomizationGraph (mathematics)Linear regressionStochastic processSignal processingEndliche ModelltheorieProcess (computing)Phase transitionOrder (biology)Game controllerRegression analysisDilution (equation)Selectivity (electronic)Sparse matrixComputer animationLecture/Conference
Transcript: English(auto-generated)
00:15
Thanks. In fact, there is one key word which is missing here, which is mathematics, and
00:22
the goal of the talk will be to try to show that there is a lot of very interesting mathematics which is emerging nowadays from all these problems, in particular neural networks, but their links also with physics, and many beautiful, totally understood problems
00:42
are coming out of these domains. So that's what I would like to try to show. And I'll begin with very general questions, which is fundamental in this field, is that basically when you're dealing with problems such as image classification,
01:04
when you're trying to understand properties of large-scale physical systems, it's about trying to understand a function f of x in very high dimension, where x is a vector in high dimension d. So think, for example, of a problem such as image classification, x would be an image to an image you want to associate, let's say it's class,
01:25
so you have such a function f of x, and x is really a very complex function or high dimensional vector such as these images. If you think at the same problem in physics, then x will describe the state of the systems. For example, in quantum chemistry,
01:44
you would describe the geometry of a molecule, and what you'd like is to compute, for example, the energy. If you have access to the energy, then you have access to the forces by computing the derivatives, so basically the physical properties of the system.
02:00
Can you learn physics by trying to approximate such a function given some example and a limited amount of examples? Different problems come into modelizations of data. In that case, what you want to model, what you want to approximate, is a probability distribution.
02:23
Again, there are very well-known, very difficult problems in physics. For example, turbulence. Since the first papers in the 1940s of Kolmogorov, that has been a central problem in physics. Try to understand how to define the probability distribution describing a turbulent
02:41
fluid with high Reynolds numbers, but you can even think of much more complex problems such as faces. Can you describe a random process whose realizations would be faces? In that case, you have something which is totally non-stationary, totally non-ergodic. Now, the reason why we can ask these questions nowadays is because of this deep neural network, which seems to be able
03:05
to do such a thing. Let me remind you what are these deep networks. I'll take the example of an image. You want to do image classifications, so x is your image that you'll input in the network. What the network is basically doing, in particular convolutional neural networks,
03:26
are a cascade of convolutions. You have a convolution operator which has a very small support, as a matter of fact, typically 3x3 or 5x5 convolution. The output is going to be transformed by a non-linearity, which is a rectifier, which basically keeps the coefficient
03:44
whenever they are above zero and puts to zero any coefficient which is negative. You'll do that with several filters which will produce a series of images here. This is the first layer of the network. Then, on each of these images, you will again apply a filter. You are going to sum up
04:02
the result and produce an output which is one of these images in the second layer. And you are going to do that for different families of filters, which will produce a whole series of images that you subsample in the next layer. And then you repeat. Each time you define a family of filters, do your convolutions, sum up, apply the non-linearity and go to the next layer
04:26
until the final layer. And in the final layer, you just simply do a linear combination and you get, hopefully, an approximation of the function f of x that you would like to approximate. Now, how do you train this network which has typically hundreds of millions of parameters which
04:47
are the parameters of all your convolutional kernels? You update them in order to get the best possible approximation of the true function f of x on the training data and here you have an optimization algorithm. Now, what was extremely surprising since, let's say, the 2010
05:08
from there on, is the fact that these kind of techniques, these kind of machines, have remarkable approximation capabilities on a very wide range of applications. Everybody's heard, of course, of
05:22
image classification but it goes much beyond. Sounds, speech recognition or what you have in your telephones now are based on such techniques. Translations, analysis of text are done with such elements. Regression in physics, computations in quantum chemistry, signal and image generation.
05:45
Basically, when you begin to have enough data, a very large amount of training data, it looks like these kind of systems are able to get the state of the art and we essentially don't understand why. There is something very mysterious because you have a single type of architecture
06:04
which is able to approximate these very different classes of problems which indicates from a mathematical point of view that these problems share the same kind of regularity so that the same kind of algorithms can approximate that. So one obvious question is to understand what is
06:22
in common with all these problems. What kind of regularities allows that kind of machines to approximate this function despite the curse of dimensionality, despite the fact that we know that in a high dimension, normally in order to approximate a function, the number of data samples are exploding exponentially with the dimension. So you can take many different point of views to
06:46
analyze that. A lot of work is devoted to the optimization side. How come you can, with a stochastic gradient descent, optimize such algorithms? That's not the kind of question I'll be asking. Here I'll be trying to understand why that kind of approximation can approximate wide class of
07:05
functions and what does it say about the underlying regularity which is needed in order to be approximated. And basically it's going to be a harmonic analysis point of view. I'll analyze that as a kind of harmonic analysis machines. So the questions are what kind of
07:24
regularity these functions have in order to be approximated and why this kind of computational architecture can approximate such function. What is really done inside the structure? What is the learning provided? So I'll be showing that there are several key properties that are coming up.
07:46
One is the fact that scales are separated and basically you do kind of multi-scale representations. If you think of it the depth that you are having here is a kind of scale axis. Why? Because you are going to aggregate the information first of a very small, very fine scale,
08:05
small neighborhood and then because you cascade this aggregation and sub-sampling you progressively aggregate the information of a wider wider and wider scales. Deformation regularity to diffeomorphism is at the core also of that kind of thing. It's at the core of physics.
08:24
We'll see that it appears also in the case of image classification. Sparsity. When you look at the coefficient in these type of networks many of these coefficients are zero because of the rectifier. It indicates that there is some kind of sparsity properties. We'll see that that's
08:40
fundamental in order to understand the kind of regularities coming out. So in order to look at the problem I'm going to look at three different type of problems. The first one is classification. Second one I'm going to look and in fact I'll begin with this one. This is slightly simpler but still extremely difficult problem. Modernization of
09:04
totally non-Gaussian but ergodic random processes such as turbulence. Now why? Because people have been doing experiments to try to model such turbulent fluid as the one you see
09:21
here. This is in astronaut physics. This is a two-dimensional vorticity field and this is bubbles and the images that you see below are synthesis from such networks. So how do they do that? They take an image. They first train the network on a data basis of images which have
09:40
nothing to do with these particular images. Once the network is trained, particularly called image net, you take the image you compute this one example. You compute all the coefficients of this image in the network within the different layers. Now you look at one layer and then you compute the correlations of two images within one layer at different channel positions.
10:08
So you get a correlation coefficient. You do that for any pair of channels. Then you have a statistical description of your random processes through these correlations and now you generate
10:21
a new realization of your random process by beginning from white noise, modifying the random process up to the point that it has the same correlation properties and then you look at the realization and it looks like these things. It looks like realization of turbulence bubbles and so on. So the question is why? What is happening here? It's reproducing something that
10:50
looks like a turbulent fluid. That was an astrophysical. It reproduced an image which is totally different but looks like a turbulent fluid. This was computed from this for any input
11:05
white noise and you do the optimization, you get a different image. So you get a random process which seems to have similar properties. Question why? What happened? What kind of model did you build? Other types of modernization of random processes for totally non-ergatic processes.
11:23
What do they do? They take an image, they use such a network in order to have an output which looks like a Gaussian white noise. Then they invert the network so that from this Gaussian white noise they recover something that looks like the original image. That's called auto
11:42
encoders. So they train that let's say on faces or bedrooms and then they synthesize new image by creating a new Gaussian white noise and applying this decoder. Now if you train that on bedrooms, for example you have a database of bedrooms, you put a new white noise and you get
12:04
a new bedroom and a new white noise you get a new bedroom. Now the most surprising thing is then you do a linear interpolations between these two white noise, okay, just linear interpolations and from any of these linear interpolations you plug it here, you reconstruct
12:24
and what you get is a kind of new bedroom and at any stage it looks like a bedroom. What does that mean? That means that in that space you have a kind of representation of bedrooms which is now a totally flat manifold because the average of two bedrooms is a bedroom. So you
12:41
have completely flattened out your points representing bedroom which is a wild set of points in the original space into something flat. Okay, what's happening? Why? So these will be the questions that will be organizing the talks. The first one will be about modernization of
13:03
random processes and these ideas of scale separations. What I'll be showing is that scale separation is at the core of the ability to reduce this curse of dimensionality and one of the very difficult problems in this field in math is to understand interactions across
13:21
scales and I'll be trying to show why non-linearities in this system provide these interactions across scales. Then the second topic will be in relation to the regularity of action of diffeomorphism. There we'll look at problems such as classifications, regressions of
13:40
energies in quantum chemistry and we'll see the kind of role that this has and again what kind of math comes out. The last one will be about the modernization of these random processes such as these bedrooms. What we'll see is that in some sense it looks like these networks build a kind of memory and the notion of sparsity is something that will be important. So these will
14:05
be the three stages of the talk. So let me begin with scales. So why is scale separation so important? This is very well known in physics. When you have an n-body problem, so a priori long-range
14:21
interaction, all the bodies are interacting. You can think of your bodies as being particles but it could be pixels, it could be agents in a social network. How can you summarize reduced interactions? Let's say with a central particle here. Where are the very strong interactions? With the neighbors, okay? Your family, the neighbor particles or pixels. Then with the
14:45
more far away particle, what you can do is instead of looking at the interaction of each particle with this one, you can aggregate this interaction, construct the equivalent field and look at the interaction of the group with this particle. With even more far away particle,
15:05
and these are called multipole methods, you can even regroup larger amounts of particles and summarize the interaction with a single term. For example, think let's say if you are in the social network, we are six billion inhabitants on the earth, you cannot neglect
15:26
even people which are very far away. Let's say some Chinese living somewhere in China because if you neglect Chinese, then you neglect China and if China let's say has some political tension with whatever France or your country, that can have an influence on your life.
15:42
You don't need to look at each Chinese but the aggregation as China. This idea of multi-scale aggregation allows to reduce the interactions into log d components. What is very difficult? What is very difficult is to understand the interactions of the groups. In other words,
16:03
the interactions in cross-scale. What has been well understood for a long time is how to do scale separations and wavelets are the math tools to do that. What is essentially not understood since the 1970s is how to model capture scale interactions and what I'll be
16:22
trying to show is that this is completely central within these networks. Okay, so how do you build scale separation? The way you do that is by introducing small waves which are wavelets which basically looks like a Gaussian modulated by a cosine or a sine wave.
16:45
What you do is you are going to scale these wavelets like that and in two dimensions you will rotate them so you get a wavelet for any angle and dilation. What you do is you take your data, you explode it along different scale and rotation by doing a convolution like
17:04
in these networks. How does it look like? In the Fourier domain, a convolution is a product. Basically, you are going to filter your Fourier transform into a channel like that and when you change the angle of the wavelet, you basically rotate the Fourier support.
17:20
When you dilate the wavelet, you dilate the Fourier support. You explode the information if you look at it in the Fourier domain into different frequency channels. Now, if you want to model a random process through, let's say, look at correlation, what you will observe is that the wavelet coefficient at two different scale or angles
17:41
are not correlated if the random process is stationary. Why? Because they live in two different frequency channels and a simple calculation shows that because the support of the wavelets in Fourier are separated, the correlation is zero. Okay, let me look at an example. This is an image. These are the wavelet coefficients
18:03
at the first scale. Basically, gray is zero, white positive, black negative. You have large coefficients near the edges. This is the average. Then you compute the next scale wavelet coefficients, next scale. What you see is that most coefficients are very small,
18:21
nearly zero, but they look very much alike across scale, yet they are not correlated. So you are unable to capture the dependence across scale just with a simple linear operator such as a correlation. Okay, in statistical physics, how are you going to model a random
18:43
process? The standard way to model a random process is to compute moments. So what is a field? You compute the expected value and then you define a probability distribution
19:05
which satisfy these moments and which has a maximum entropy, which is a way to express that you have not any more information. You look at all possible configurations having that moment and what you can rather easily show is that then you get a Gibbs distribution and this
19:23
Gibbs distribution is defined by Lagrange multipliers which are adjusted in order to satisfy these moments. Now, what have people been doing mostly until now is to compute moments which are basically correlation moments and that's exactly what the Kolmogorov model
19:45
of turbulence is about. If you do that, then what you are going to get here is a bilinear function and therefore, sorry, you are going to get a Gaussian distribution. So if you look at a Gaussian model of turbulence, that's what you're going to get, the images which are below.
20:02
The images which are below have exactly the same moments than the one above, so they are the maximum entropy model constrained by second order moments, same spectrum, but you've lost all the geometry of the structure. So, what have people been trying to do in statistics? Go to high order moments,
20:23
but if you go to high order moments, you have many moments, they have a huge variance and the estimators are in fact very bad because of the variance. Deep networks seems to get view estimators which look much better. Why? The key point here
20:42
will be the non-linearity. What I want to show here is that the non-linearity is what builds the relation across scales, and the key way you are going to relate scales is through phase. This is what is the link between scales. If you take a wavelet which
21:01
has a certain phase alpha, I'll call it that way, I'm going to build a network where I'm going to impose that the filters are wavelets. Okay, so I take my x, I filter it with a wavelet and I apply a rectifier. What's happening if you do that? Let's look at this convolution.
21:24
I convolve it with a wavelet which has a certain phase, I can get out the modulus of the convolution and I'm going to have a cosine which depends upon the phase of the wavelet and the phase of the convolution. Now, what's happening when you put a rectifier?
21:41
The rectifier is a homogeneous operator. The modulus, you can get it out. The rectifier only transforms the phase by essentially killing the negative coefficients. You can view the rectifier as being a window on the phase. It eliminates all the phase which corresponds to negative
22:01
coefficients and keeps the phase corresponding to positive coefficients. Now, what if you now do a Fourier transform relatively to this phase variable alpha? What you see appearing is that after applying your rectifier, if I do take my coefficient and do a Fourier transform
22:21
relatively to the phase variable, I see appearing the modulus of the output of the filtering, I see appearing the phase but each phase is multiplied by a harmonic component k. So, you do something very non-linear which is you create all kind of harmonics
22:41
of your phase. Now, why is that fundamental if you want to model random processes? You see, if you take, and I write it that way, a convolution and you take the exponent to the power k of the phase, what you are going to do is essentially you are going to move
23:01
the Fourier support. So, suppose that I look in one dimension, have a random process and I look at the component on two different frequency intervals because I have two wavelets which lives over a different frequency. They are not correlated, these two components, because the Fourier components don't interact. If you apply a harmonic, this frequency is going to move,
23:26
k equals two is going to move here in two lambda, k equals three here. Now, if you look at these two components, now they are correlated. So, after applying your non-linearity, you create correlations because you move your Fourier support. What does that mean?
23:47
That means that if you look over your domain where you've separated all the phase, all the directions, after applying the rectifier, you can view it in that domain or in the Fourier domain which amounts to compute the harmonics. All these blobs are going to be correlated.
24:05
You can correlate the coefficient within a given scale by just computing a standard correlation. You can compute a correlation across two orientations by using the appropriate exponent which in that case is k equals zero, it's the modulus, and you can compute the correlations
24:24
across two different bands. And if you look at that, this is very close to calculations of the renormalization group. The renormalization group is what allows you to compute in the particular case of the easing models what kind of random processes you are going to have
24:41
and how you do it. You do it by looking at interactions of the different scales. Numerically what do you have? These are examples of random processes. I'm now going to compute a maximum entropy process not conditioned on correlations but conditions on these non-linear harmonic correlations, these ones. And that's what you get. In the case of
25:06
Ising at critical scale, you can reproduce realizations of Ising. For turbulence, you produce realizations of random processes which are here. Contrarily to the Gaussian case, now you see that the structure geometrical structures appears because you've restored
25:24
the alignment of phase. And now one of the very beautiful questions is can we extend the calculation of renormalization groups which we know how to do on Ising on much more complex processes such as for example turbulence process in order to understand better the kind of
25:42
property these random processes have. And that's work that is being done with people at ENS in astrophysics in particular. Okay let me now move to the second problem. The second problem is about classification. So you want to classify for example digits. One of
26:02
the properties that you see in classification is when your digit moves it's deformed. If the deformation is not too big, typically it will belong to the same class. A three stays a three, a five stays a five. And if you take let's say paintings as long as you move on your
26:22
diffeomorphism group, as long as the diffeomorphism is not too big basically you'll recognize the same painting then it will be another painting. And if you move like that on the diffeomorphism group you can go across essentially all European paintings that you may find in the Louvre in particular. Okay so diffeomorphism is a key element of
26:46
regularity. If you want to approximate a function which is regular to diffeomorphism you want to build descriptors which are regular to the action of diffeomorphism. How can you do that? X is a function in L2. If you deform it it's not the distance with x is going to be very
27:06
large. How do you build regularity to diffeomorphism? A very simple way is just to average x. You average it let's say with a Gaussian and you are going to get a descriptor which is going to become very regular to the action of diffeomorphism as long as it's not too big
27:23
radically to the j. The problem is that if you do that you lose information because you've been averaging so how can you recover the information which was lost? The information which was lost are the high frequencies that you can capture with wavelengths.
27:43
But if then you average you are going to get zero because these are oscillating functions. How can you get a non-zero coefficient? Apply the rectifier which is positive and these will be coefficients which are again going to be regular to action of diffeomorphism.
28:01
But these wavelet coefficients you averaged them so you lost information by doing the averaging. How can you recover the information that you've lost? Well you take these coefficients and you extract their high frequencies. How can you do that? Again with wavelets. And why are
28:21
wavelets very natural here? Because if you want to be regular what is a diffeomorphism? Diffeomorphism is a local deformation, a local dilation. If you want to be regular to actions of diffeomorphism you have to separate scales. And that's what the wavelets will do and you get a new set of coefficients and an averaging. So that's going to look like a convolutional
28:43
network where you iterate convolution with wavelets, non-linearity, convolution with wavelets. But in this network I don't learn the filters. I impose the filters because I have a prior on my knowledge of the kind of regularity I want to produce. So one thing that you can
29:06
prove is that if you build such a cascade where you take your function x and you deform you are going to get a representation which is Lipschitz continuous to the action of diffeomorphism. In what sense? If x is deformed, if you look at this coefficient, this as a
29:26
located as a vector, if you look at the Euclidean distance between the representation of let's say the output of your network before and after the deformation, the distance is going to be of the order of the size of the deformation. And the weak topology of a diffeomorphism is
29:45
defined by the size of the Jacobian of the deformation or the translation operator that depends upon space and you can prove that you have something which is stable. So to build something which is regular to deformation naturally leads you to scale
30:04
separation again and to the use of these non-linearities. The question is how good will that be compared to deep networks? So you have a kind of network but you haven't learned the filters. How good is that going to be compared to network where you learn everything?
30:23
The first problem I'm going to look at here is quantum chemistry. Now quantum chemistry is an interesting example because you have prior information on the type of function you want to approximate. What do you know? So the problem is the following. x is the state of the system is described by a set of atom position and charge and the energy of let's say molecule
30:50
you know that if you translate the molecule it's not going to change. If you rotate the molecule it's not going to change so it's invariant to translation and rotation.
31:00
If you slightly deform the molecule the energy is just going to change slightly so you have a regularity to the action of diffeomorphism. Question, can I learn such a function just from a database which gives me configuration of molecules and the value of the energy? Now
31:22
if you look in quantum chemistry the way such energies are calculated is by using what is called DFT so the key idea is the following. You take a molecule and you look at the electronic density of the molecule that's what I'm showing here. Each gray level gives you the probability
31:44
of occurrence of an electron at a given position okay and they are very close to the atoms but they are also in between two atoms because that's the chemical balance which is here. So to compute such a thing it requires to solve the Schrodinger equation so in that framework
32:04
I suppose I don't know physics beyond these basic environments so I don't know Schrodinger equation. What we're going to do and there is now a whole community in physics and machine learning doing that kind of thing is here going to represent the molecule just by the state
32:24
the only thing I know in x is the position of each atom and the number of electrons on each atom so I'm going to represent naively the electronic density as if each electron were sitting exactly where the core or the nuclei of the atom was so you get a kind of electronic
32:47
density like that here I have no idea what chemistry is about. Then you build a learning system so you take your density in 3D and you compute a representation by separating all scale
33:03
separating all angle applying rectifiers and you get these kind of images of 3D blocks which a little bit like orbitals. Then you apply your linear averaging which builds a number of descriptors which are of the order of log of the dimension which are invariant to translation
33:24
rotations and stable to deformation. And where do you learn physics? In the light stage where you just learn the weights of the linear combinations of all these descriptors to try to approximate the true energy of the molecule. How do you learn these coefficients? Because you
33:47
have a database of example and you regress your coefficient on your database. Okay there are databases which have been constructed to test that kind of thing typical size about 130 000
34:02
molecules these are organic molecules and what people do they compute their deep network where they learn everything and they compare the errors with the errors that a numerical algorithm would do with a DFT. And what people have been observing is that with that kind of technique
34:22
you can get if the database is rich enough an error which is smaller than the error that is calculated by a typical numerical scheme with a DFT. In that case we don't learn the filter we just say we know some regularity properties so the math leads to a certain type of filters
34:47
and basically you get an error of which is of the same order. So that shows that in that kind of examples you don't need you basically know what is being what is learned. What is learned
35:01
is the type of regularity and the only thing that you what is learned in the filters and you can therefore replace them with wavelets and the only thing that you need to learn is basically the linear weights at the output. However these are very let's say simple problems in the sense that the kind of database that have been used until now
35:25
are database of small molecules about 30 atoms. Yes sorry. Yeah when I was comparing here these are the deep nets when I call every deep net is when
35:44
you learn everything. When you learn everything you get an error of the order of 0.5 kilocalories per mole when you don't learn anything you get an error of the same order if you use wavelets which are adapted to the kind of transformation where you want to be regular or invariant. In that kind of case the networks is in fact smaller because you
36:13
you know exactly what you want basically in that case you know exactly the kind of filters you don't have to to build a big structure. But again these are not horribly complicated
36:26
problem and if you look in the world of images you can see the difference between simple and hard problem. What is a simple problem? Recognizing digits. So you have an image and you have a digit you have to recognize what kind of digit it is.
36:43
Differentiating textures which are uniform random processes. If you take a deep network and you learn everything or if you impose the filters and you impose that the filters are wavelets you get about the same kind of thing. If then you move to something much more complicated
37:01
such as that kind of image net. If you impose filters which are wavelets the areas is going to be in that case you have 1000 classes about 50%. If you learn the filters the error is much smaller in 2012 that was the big result that began to attract very much
37:23
attention the error was 20% now it's about 5%. So the question here is what is learned? What is learned in these kind of networks? And that's the last part. What I would like to show is that there is a simple mathematical model which can capture the first order of what is learned
37:46
is basically learning dictionaries to get sparse approximations. So if you think at this domain was called pattern recognition before. What is a pattern? Pattern is a structure which is
38:01
approximating your signal and which is important for classification. How can you think of decomposing a signal into in terms of patterns? X is my data. I'm going to define a dictionary of pattern each column of my matrix is a particular pattern. To decompose x as a sum of a limited number of patterns can be written as a product of this
38:28
matrix with a sparse vector z which is mostly zero besides the few patterns that you are going to select to represent x. Now how can you express such a problem? This is a well-known field
38:44
just called sparse representations that has been studied since basically the 90s and one way you can specify this problem is by saying well x is going to be approximated by d multiplied by z
39:01
and I want that z is sparse so I want to impose that the L1 norm of z is small. So I solve this optimization problem and I'm also going to impose that the coefficients of z here are positive and here you have a convex minimization problem. Good. So how can you solve such a convex
39:21
minimization problems? There are different type of algorithms to do that. Basically these are iterative algorithms. They amount to do a gradient, you have a convex problem here, so a gradient applied to the squared norm term which is going to lead to that kind of matrix and you have your L1 term that you are going to minimize by doing a non-linear projection.
39:48
It happens and in this case the non-linear projector is exactly a rectifier. So basically to solve that kind of problem is amount to apply a linear operator, do apply rectifier
40:02
with what is called here bias and the bias corresponds to the Lagrange multiplier. So to solve that you essentially compute a deep network. A deep network where at each stage you are going to apply your matrix which is here a linear matrix and a rectifier
40:24
and you iterate. In this network the matrices are all the same. They only depend upon the one matrix or dictionary. Now how can you use that for doing learning? You can use that for doing learning by saying
40:43
okay I would like to extract the best patterns that is going to lead to the best classification. So I'm going to build a network where you compute your sparse coding, then I put a classifier and I get my classification result. And now what do you want to do? You want to compute
41:04
the best dictionary which is going to lead to the smallest error. So you are going to optimize the weight in the dictionary d and the classifier so that over a given database
41:21
of data and labels, the classification error, the loss, is as small as possible. And so you do a standard gradient descent. So this is a standard neural network learning. The only thing is that you are doing something which is from a math point of view well understood. You are just doing a sparse approximation where you are learning your matrix
41:49
and you can look at the convergence of that kind of thing. Okay so this is what wavelets are giving you. If you don't learn anything in your network you are going to do
42:04
a cascade with predefined wavelets, compute your invariant, do your classification and you essentially have 50 percent of error. What you could think is okay let's replace the wavelet representation with a dictionary. So you learn a dictionary which is optimized
42:24
in order to minimize the error and you don't improve much. Essentially you get the same kind of error. Now what if you cascade the two? You first compute your representation with your invariant which are now regular to the action of diffeomorphism and then in that space you
42:45
learn the dictionary. And there you have a big drop of error which goes down to 18 percent which is essentially better than what was obtained in 2012 with this famous cell extent. What does that mean? That means that there is really two elements. There is one element which
43:04
is due to the geometry that you know which are captured by translation rotation diffeomorphism. You want to essentially reduce, eliminate these variabilities. Once you've eliminated these variabilities you can define a set of patterns because otherwise
43:22
you need to define a different pattern for any deformation, any translation, any rotation. Your dictionary gets absolutely huge. Now why you have such an error reduction is an open problem. What you observe is in the output you have a kind of concentration phenomena that we don't quite understand but at that stage you can build much simpler models
43:44
which is basically a cascade of two well-known operators. So the last applications I'm going to show you briefly is about these autoencoders. So these autoencoders they are able to basically
44:05
synthesize random processes which are absolutely not ergodic and I gave this example of bedrooms where you also see these deformation regularity properties.
44:25
One way to pose that problem is the following. You begin with a random process X. What essentially this encoder is doing is building a Gaussian white noise. So you found a map which is building a Gaussian white noise. This map is invertible. I'm going to impose that this
44:45
map is B-Lipschitz by Lipschitz over the support of the random process. The third property is you want that kind of deformation properties so you want your map to be Lipschitz continuous to the actions of diffeomorphism. Questions how to build such a map, how to invert it.
45:03
So how to build such a map? We have constructed something which is regular to the action of diffeomorphism just before separating the scale and then doing a spatial average. Now why would that build something which is Gaussian? Because when you begin to
45:25
average of a very large domain you begin to mix your random variables and if your random variable have a bit of the correlations, if you average them over a very large domain you have your central limit theorems which tells you that's going to converge to Gaussian random
45:43
variables. So then you do a linear operator which whitens your Gaussian and you can hope to get a Gaussian random variable. Now the question is the inversion. Is this map going to be by Lipschitz? Can you invert it and how does it relate to the previous topic? So
46:03
what is hard to invert here? What is hard to invert it's not this first non-linear part because u is the non-linearity is due to rectifier but if you have rectifier of a and which corresponds to two different phase you get back a so that's easy to invert. The wavelet
46:22
transform is invertible so that's not a problem. What is hard to invert is this averaging. The averaging is basically building a Gaussian process by reducing the dimension and mixing all the random variables and that's not an invertible operator. Now how can you invert a linear
46:43
operator which lose information? You can if you have some prior information about the sparsity of your data and that's called compressed sensing. So how do you do that kind of thing? Same idea, you need to learn a dictionary where your data is going to be sparse. So the idea is the following.
47:08
Your representation in some dictionaries, you don't know what it is for now, is going to be sparse so can be represented as a sparse vector multiplied by d. If that's the case the white noise that you've obtained by applying your operator l
47:23
is going to have itself a sparse representation so because ux is equal to dz in that dictionary the dictionary is now ld. So what does that mean? That means that in order to invert your map
47:41
what you need is to compute the sparse code. So compute this dictionary which is going to sparsify this w and then you can apply your dictionary d invert you, you are going to recover x. How do you do that? How do you learn this dictionary? Well you learn this dictionary
48:05
by essentially taking your examples and each time trying to find the dictionary which is going to reproduce the best examples and that's done by optimizing the dictionary in this neural network.
48:22
It's again with a stochastic gradient descent. So I'm going to show examples. These are examples of faces. The top images are the training examples on which you train your network and you optimize the dictionary and then you reconstruct these images. That's
48:40
what you see here. Then you try with neuralization of your random process. So you have neuralization, these are the testing image. You decompose, now you use the dictionary that was computed from the first images and you try to reconstruct and indeed the reconstructed images looks good. So that means that you indeed have inverted your bilipschitz map.
49:08
You can train your network of database. That's typically what an autoencoder is doing. This is the training images. These are the reconstructed ones.
49:22
This is what you do on the testing images. These are the reconstructed ones. Now what's happening if you take two white noise from which you compute an image and you do a linear interpolation in the noise domain? The noise domain in this case is
49:44
essentially the domain of the scattering coefficients which have been averaged. Because they are again regular to the action of diffeomorphism, when you do a linear interpolation and you reconstruct an image you see how basically you warp progressively one image
50:02
into the other one. If you do that on a different image and another one you see the same kind of thing. If you do that on a bedroom you warp one image into the other image. Now, synthesis. So now you've represented again your random process with your scattering
50:24
coefficient and you learned the dictionary which represents your coefficient in a sparse way. Now the synthesis amounts to produce randomly a Gaussian white noise, throw this random white noise through this generator and compute an x. So now you have computed, you've defined
50:47
a random process. This random process for differentiation of white noise is how they look like. So you've produced a random generator of faces. That's what autoencoders are essentially doing. What I'm showing here is you can do it just by learning a single matrix
51:08
which is this dictionary. If you train it on bedrooms, it doesn't look so good because bedrooms are much more complicated, but you see that your realization looks a bit like geometric bedroom,
51:20
at least it doesn't look like faces. What you get here is that essentially you have totally different types of stochastic processes, models, where everything is captured within the dictionary and the random excitation defines the space of your free, let's say, variables or the entropy
51:45
of the random process. You can do things, for example, and that's important in particular for simulations in physics. You may know when you compute, for example, in fluid dynamics, you may have a very coarse grid approximation, for example, in climatology and you'd like to
52:03
get a model of what's happening at fine scale. So this is the input, the very coarse scale approximation, but you've learned your dictionary. Now you put noise and these are all the possible realizations which have the same low frequencies and each realizations corresponds to different
52:21
noise, corresponds to different face expressions. So we are in a world which are very unusual from a math point of view, but what I want to say is that there is a lot of math to be done here. It's not just algorithms. Now if you come back on deep networks and the topic itself,
52:44
essentially that kind of structures, they have the complexity of a Turing machine. Turing machine is programmed by program here. The program is essentially these weights and you have built, let's say, hundreds of millions or billions of weights. Now in a Turing machine,
53:02
if you want to make a code which has no bug, what you do is you structure your program. You don't build a program with millions of lines of code. Currently we are training these machines directly by training the millions of coefficients which makes them extremely complicated. In some sense what I'm showing here is that when you begin to understand the math,
53:23
you can structure these machines and you see appearing different kind of functions. The first few layers essentially in that case corresponds to reducing the geometry of the problems. Then you see appearing phenomena which are more related to sparsity and there are probably many many other phenomena that are absolutely not incorporated here. Now this being
53:45
said, the math problems remains very widely open. Well the kind of question you'd like to ask is again, what is the regularity class? What are the set of functions that can be approximated by such networks? What kind of approximation theorems? How many neurons do you need to get an epsilon
54:04
error? These questions are totally not understood. Okay well thanks very much.
54:25
Thank you for your talk. When you were showing the example of the reconstructed images between the training set and the testing set, it seemed to me the the test reconstructed images were much closer to the original than in the training. So that means that I've
54:48
inverted the two. That's exactly what it means because normally it's the contrary. Yes you are right and it shows that I inverted the the two columns. Thanks.
55:02
No there is no miracle which is too bad but there is no way you can do better. Thank you very much. Wow I'm very impressed. Yes? Yeah very interesting. For this last topic of modeling the network in terms of wavelets and then sparsity, there's a kind of a prediction
55:26
for how much information there should be in a network that comes from statistical learning theories. So you can relate the generalization ability or the error rate of the network to the amount of information in the network. It would be very interesting to see if
55:43
the amount of information in your sparsity, right, because in some sense your wavelet universal. So that's like not much information. So all the information is in this dictionary and so if that were to match up with the prediction. The information bounce that you
56:01
get currently on, I mean if you are referring to the information bounce that people have been computing on the networks, they are essentially being computed on the norm of the operators and are extremely crude. Yeah that's right but there's this if you go back, I mean there's this target that people wanted to get which is to relate controlled by the generalization the
56:25
testing error. Yeah so yeah you're right that could be right now we have indeed to see how to understand the action of this learning of a dictionary. The problem is much simpler than usual network because the learning is aggregated in a single matrix but because it's very non-linear
56:46
and you put it at each layer it's indeed not so simple. So yeah you're right that's an interesting way to think about it. Yes that's a question about the various parts of your talk.
57:00
In one part you explained that regression approximation is very difficult to capture higher correlations and escape phenomena and at the very end you have this random Gaussian matrix or data that produce something with very high correlation so it's these are two point of views but it's not three of them. Okay so it's two but the same.
57:25
Okay let me explain. The reason why you can compute correlation across scale angles and so on is because of the non-linearity which essentially realigns in some sense you can view it as a realignment of the phase. Now in these synthesis what you have is the you have the
57:49
dictionaries which are here so you can view these coefficients as exciting different patterns and then the different which are here are essentially perform are selecting randomly
58:07
these patterns but you can also view it as a realignment of phase. So what is absolutely not there in the second one is the maximum entropy principle optimization. There is a what we
58:22
are it's much more complicated in the second one because I mean in the standard statistical physics framework because there is really no quantity that you are optimizing you are not controlling anything. The second one is in fact very close to something which is used all over
58:41
in signal processing which is autoregressive models. Think of it what how do you build a simple Gaussian approximation. You begin with your random process you whiten the random process with an operator and then you invert this operator you reconstruct your random process this is what is done by an autoregressive filter. This is the equivalent of an autoregressive
59:04
filter but in a totally non-linear way because you didn't begin with something Gaussian. That's one way to view these autoencoder. How transferable are these methods to learning the processes on graphs
59:25
social networks for example? That I don't know so you what you mean learning the processes on class you mean what? On graph. Sorry I didn't understand. Okay so what people have there are a number of
59:40
groups we've been trying to to do that is basically what you need is to reintroduce the notion of translation deformations and scale on graph. Well the notion and all these notions can be defined it's just the translation operator is much more complicated people you can do that by going to the spectral you a comp
01:00:00
the spectrum of the graph with a Laplace Beltrami operator, and from there you can do similar things. So people have been trying to do that. The math are more complex because translation operator on a graph is much more complicated.
Recommendations
Series of 39 media
Series of 11 media