Multiscale Models for Image Classification and Physics with Deep Networks

Video in TIB AV-Portal: Multiscale Models for Image Classification and Physics with Deep Networks

Formal Metadata

Multiscale Models for Image Classification and Physics with Deep Networks
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Approximating high-dimensional functionals with low-dimensional models is a central issue of machine learning, image processing, physics and mathematics. Deep convolutional networks are able to approximate such functionals over a wide range of applications. This talk shows that these computational architectures take advantage of scale separation, symmetries and sparse representations. We introduce simplified architectures which can be anlalyzed mathematically. Scale separations is performed with wavelets and scale interactions are captured through phase coherence. We show applications to image classification and generation as well as regression of quantum molecular energies and modelization of turbulence flows.
Complex (psychology) Range (statistics) Dimensional analysis Data model Medical imaging Stochastic Rectifier Different (Kate Ryan album) Single-precision floating-point format Core dump Hausdorff dimension Physical system Social class Machine learning Enterprise architecture Theory of relativity Sampling (statistics) Shared memory Food energy Computer Formal language Virtual machine Category of being Order (biology) Summierbarkeit Geometry Point (geometry) Computer-generated imagery Maxima and minima Translation (relic) Mathematical analysis Regular graph Element (mathematics) Number Internet forum Data structure Normal (geometry) Complex analysis Information Artificial neural network Neighbourhood (graph theory) Computer network Physical computing Cartesian coordinate system Limit (category theory) Convolution Word Software Personal digital assistant Musical ensemble Exception handling Family Gradient descent Convolution State of matter Multiplication sign View (database) Combinational logic Public domain Parameter (computer programming) Function (mathematics) Food energy Derivation (linguistics) Mathematics Series (mathematics) Pixel Pattern recognition Algorithm Electric generator Computer-generated imagery Linear regression Physicalism Type theory Sparse matrix Vector space output Physical system Resultant Harmonic analysis Filter <Stochastik> Probability distribution Functional (mathematics) Link (knot theory) Artificial neural network Virtual machine Field (computer science) 2 (number) Centralizer and normalizer Regular graph Harmonic analysis Mathematical optimization Linear map Stochastic process Multiplication Forcing (mathematics) Mathematical analysis Approximation Physics Speech synthesis Turbulence Coefficient Reynolds number
Group action Pixel Basis (linear algebra) Linear regression User interface Distribution (mathematics) Graph (mathematics) Principle of maximum entropy Dimensional analysis Data model Medical imaging Estimator Rectifier Different (Kate Ryan album) Single-precision floating-point format Ergodentheorie Core dump Hausdorff dimension Imaginary number Physical system Rotation Theory of relativity Fourier series Zeitdilatation Maxima and minima Food energy Motion capture Mereology Category of being Wave Process (computing) Frequency Angle Entropie <Informationstheorie> Order (biology) Phase transition Theorem Weißes Rauschen Endliche Modelltheorie Wavelet transform Bilinear map Spacetime Geometry Point (geometry) Random number Transformation (genetics) Computer-generated imagery Motion capture Maxima and minima Similarity (geometry) Mathematical analysis Mass Regular graph Product (business) Vortex Kritischer Exponent Frequency Term (mathematics) Autocorrelation Representation (politics) Data structure Quantum Turbulence Binary multiplier Alpha (investment) Distribution (mathematics) Wavelet Information Weight Exponentiation Computer network Differentiable manifold Fourier transform Group action Convolution Word Software Nonlinear system Personal digital assistant Pearson product-moment correlation coefficient Video game Family Rectifier Gibbs-sampling Window Regression analysis Wavelet Building Zeitdilatation Multiplication sign Direction (geometry) Orientation (vector space) 1 (number) Set (mathematics) Public domain Function (mathematics) Food energy Logic synthesis Mathematics Component-based software engineering Flat manifold Semiconductor memory Phase transition Oval Process (computing) Endliche Modelltheorie Pixel Position operator Computer-generated imagery Moment (mathematics) Normal distribution Physicalism Range (statistics) Two-dimensional space Statistics Type theory Data model Sparse matrix Curvature Korrelationsmatrix Cross-correlation Renormalization group output Configuration space Reduction of order Harmonic analysis Filter <Stochastik> Probability distribution Digital filter Statistics Interpolation Ising-Modell Link (knot theory) Moment (mathematics) Statistical dispersion Sparse matrix Fourier series Field (computer science) Trigonometric functions Power (physics) Regular graph Nonlinear system Cross-correlation Average Operator (mathematics) Harmonic analysis Spacetime Differential equation Wavelet transform Mathematical optimization Condition number Window Stochastic process Execution unit Multiplication Interactive television Variance Correlation and dependence Database Statistical physics Logic synthesis Similarity (geometry) Particle system Calculation Turbulence Coefficient Spectrum (functional analysis)
Complex (psychology) Group action Density functional theory Video projector Thermal expansion Numbering scheme Data dictionary Mathematical model Dimensional analysis Dirac equation Medical imaging Invariant (mathematics) Rectifier Insertion loss Different (Kate Ryan album) Atomic number Core dump Lipschitz-Stetigkeit Error message HTTP cookie Physical system Social class Rotation Theory of relativity Texture mapping Regression analysis Digitizing Gradient Zeitdilatation Bit Food energy Orbit Category of being Wave Process (computing) Angle Telecommunication Order (biology) Theorem Pattern language Summierbarkeit Astrophysics Spacetime Point (geometry) Atomic nucleus Algorithm Transformation (genetics) Computer-generated imagery Translation (relic) Regular graph Schwache Topologie Element (mathematics) Number Product (business) Morphismus Frequency Term (mathematics) Representation (politics) Energy level Nichtlineares Gleichungssystem Data structure Binary multiplier Standard deviation Wavelet Information Key (cryptography) Density functional theory Artificial neural network Weight First-order logic Computer network Deep Web Exponential function Matrix (mathematics) Limit (category theory) Convolution Sign (mathematics) Homotopie Population density Error message Software Nonlinear system Integrated development environment Personal digital assistant Numerical analysis Social class Force Gradient descent NP-hard State of matter Texture mapping Gradient View (database) Combinational logic Set (mathematics) Public domain Insertion loss Thermodynamic equilibrium Function (mathematics) Mereology Diffeomorphism Food energy Rotation Mathematics Machine learning Oval Pattern language Software framework Information Position operator Fiber (mathematics) Ising-Modell Pattern recognition Algorithm Computer-generated imagery Linear regression Optimization problem Physicalism Hypothesis Type theory Convex optimization Sparse matrix Vector space Renormalization group Configuration space Normal (geometry) Linear map Physical system Resultant Filter <Stochastik> Functional (mathematics) Numerical digit Artificial neural network Sparse matrix Distance Field (computer science) Local Group Regular graph Linear subspace Population density Telecommunication Average Operator (mathematics) Wavelet transform Loop (music) Gradient descent Linear map Stochastic process Basis (linear algebra) Data dictionary Projective plane Correlation and dependence Database Continuous function Approximation Spärliche Codierung Physics Turbulence Videoconferencing Coefficient Convex set
Gaussian process Group action Zufallsvektor Code Gradient Multiplication sign Set (mathematics) Public domain Function (mathematics) Mereology Data dictionary Diffeomorphism Dimensional analysis Medical imaging Stochastic Rectifier Ergodentheorie Video projector Endliche Modelltheorie Lipschitz-Stetigkeit Error message Rotation Mapping Computer-generated imagery Concentric Normal distribution Inverse element Bit Category of being Sparse matrix Vector space GAUSS (software) Order (biology) Phase transition Theorem Weißes Rauschen Pattern language Linear map Electric generator Reduction of order Geometry Spacetime Interpolation Statistical dispersion Computer-generated imagery Translation (relic) Limit (category theory) Drop (liquid) Sparse matrix Regular graph Inversion (music) Element (mathematics) Morphismus Cross-correlation Average Operator (mathematics) Reduction of order Representation (politics) Software testing Codierung <Programmierung> Gradient descent Linear map Stochastic process Noise (electronics) Data dictionary Wavelet Quantum state Information Statistical dispersion Artificial neural network Computer network Database Group action Continuous function Cartesian coordinate system Spärliche Codierung Inversion (music) Coding theory Error message Stochastic Integrated development environment Nonlinear system Software Personal digital assistant Social class Theory of everything Central limit theorem Coefficient Gradient descent
Complex (psychology) Computer program Dynamical system Code Correspondence (mathematics) View (database) Set (mathematics) Logic synthesis Turing-Maschine Software bug Data model Mathematics Different (Kate Ryan album) Phase transition Entropie <Informationstheorie> Noise Process (computing) Endliche Modelltheorie Social class Algorithm Theory of relativity Complex (psychology) Physicalism Computer simulation Bit Type theory Data model Sparse matrix Entropie <Informationstheorie> Phase transition Theorem Weißes Rauschen output Endliche Modelltheorie Functional (mathematics) Curve fitting Electric generator Geometry Spacetime Harmonic analysis Point (geometry) Random number Functional (mathematics) Random number generation Statistical dispersion Artificial neural network Virtual machine Maxima and minima Mathematical analysis Regular graph Approximation Computational physics Fluid Theorem Data structure output Stochastic process Noise (electronics) Data dictionary Statistical dispersion Weight Expression Computer network Line (geometry) Matrix (mathematics) Approximation Homotopie Word Stochastic Software Personal digital assistant Coefficient Limit of a function
Computer-generated imagery Set (mathematics) Software testing Angle
Regression analysis Group action View (database) Principle of maximum entropy Data dictionary Logic synthesis Neuroinformatik Mathematics Machine learning Different (Kate Ryan album) Single-precision floating-point format Software framework Error message Social class Predictability Sparse matrix Data model Process (computing) Angle Phase transition Order (biology) Normal (geometry) Pattern language Escape character Point (geometry) Game controller Translation (relic) Similarity (geometry) Theory Number Cross-correlation Term (mathematics) Operator (mathematics) Software testing Dilution (equation) Selectivity (electronic) Mathematical optimization Stochastic process Standard deviation Wavelet Graph (mathematics) Information Graph (mathematics) Matrix (mathematics) Approximation Signal processing Equivalence relation Statistical physics Software Universe (mathematics) Musical ensemble Coefficient Spectrum (functional analysis)
[Music] [Music] Thanks in fact there is one key word which is missing here which is mathematics and the goal of the talk will be to try to show that there is a lot of very interesting mathematics which is emerging nowadays from all these problems in particular neural networks but their links also with physics and many beautiful totally and
understood problems are coming out of of these of these domains so that's what I would like to try to show and I'll begin with the very general questions which is fundamental in this field is that basically when you are dealing with problems such as image classification when you are trying to understand properties of large scale physical systems it's about trying to understand a function or a function f of X in very high dimension where X is a vector in high dimension D so think for example of a problems such as image classification X would be an image to an image you want to associate let's say it's class so you have such a function f of X and X is really a very complex function or high dimensional vectors such as these images if you think at the same problem in physics then X will describe the state of the systems for example in quantum chemistry you would describe the geometry of a molecule and what you'd like is to compute for example the energy if you have access to the energy then you have access to the forces by computing the derivatives so basically the physical properties of the system can you learn physics by trying to approximate such a function given some example and a limited amount of examples and different problems come into moderations of data in that case what you want to model what you want to approximate is a probability distribution and again there are very well very difficult problems in physics for example turbulence since the first papers in the 1940s of Kolmogorov that has been a central problem in physics try to understand how to define the probability distribution describing a turbulent fluid with high Reynolds numbers but you can even think of much more complex problems such as faces can you describe a random process whose relations would be faces in that case you have something which is totally non stationary totally non air gothic now the reason why we can ask these questions nowadays is because of these deep neural network which seems to be able to do such a thing so let me remind you what are these deep networks I'll take the example of an image so you want to do image classifications so X is your image that you'll input in the network well the network's is basically doing in particular convolutional neural networks our cascade of convolution so you have a convolution operator which has a very small support as a matter of fact typically 3 by 3 or 5 by 5 convolution and the output is going to be transformed by a non-linearity which is a rectifier which basically keeps the coefficient whenever they are above zero and puts to 0 any coefficient which is negative and you'll do that with several filters which will produce a series of images here this is the first layer of the network then on each of these image you will again apply a filter you are going to sum up the result and produce an output which is one of these image in the second layer and you are going to do that for different families of filters which will produce a whole series of images that you sub sample in the next leg and then you repeat each time you define a families of filters do your convolution sum up apply the non-linearity and go to the next layer until the final layer and in the final layer you just simply do a linear combination and you get hopefully an approximation of the function f of X that's would like to approximate now how do you train this network which has typically hundreds of millions of parameters which are the parameters of all your convolutional kernels you update them in order to get the best possible approximation of the true function f of X on the training data and here you have an optimization algorithm now what was extremely surprising since let's say the two thousand ten from their own is the fact that these kind of techniques this kind of machines have remarkable approximation capabilities on a very wide range of applications everybody's heard of course of image classification but it goes much beyond sounds speech recognition or what you have in your telephones now are based on such techniques translations analysis of text are done with such elements regression in physics computations in quantum chemistry signal and image generation basically when you begin to have enough data a very large amount of training data it looks like these kind of systems are able to get the state-of-the-art and we essentially don't understand why there is something very mysterious because you have a single type of architecture which is able to approximate these very different classes of problems which indicates from a mathematical point of view that these problems shares the same kind of regularity so that the same kind of algorithms can approximate that so one of this question is to understand what is in common with all these problems what kind of regularities allows that kind of machines to approximate this function despite the curse of dimensionality despite the fact that we know that in high dimension normally in order to approximate a function the number of data samples are exploding exponentially with the damage so you can take many different point of views to analyze that a lot of work is devoted to the optimization side how come you can with a stochastic gradient descent optimize such algorithms that's not the kind of question I'll be asking here I'll be trying to understand why that kind of approximation can approximate wide class of functions and what does it say about the underlying regularity which is needed in order to be approximated and basically it's going to be a harmonic analysis point of view I'll analyze that as a kind of harmonic analysis machines so the questions are what kind of regularity these functions have in order to be approximated and why these kind of computational architecture can approximate such function what is really done inside these structure what is the learning providing so I'll be showing that there are several key properties that are coming up one is the fact that scales are separated and basically you do kind of multi scale representations if you think of it the depths that you are having here is a kind of scale axis why because you are going to aggregate the information first of a very small very fine scale small neighborhood and then because you cascade these aggregation and subsampling you progressively aggregate the information of a wider wider and wider scales deformation regularity to deform or fizzle is at the core also of that kind of thing it's at the core of physics we'll see that it appears also in the case of image classifications sparsity when you look at the coefficient in these type of networks many of these coefficients are 0 because of the rectifier it indicates that there is some kind of sparsity properties we'll see that that's fundamental in order to understand the kind of regularity is coming at so in order to look at the problem I'm going to look at three different type of problems the first one is classifications second
one I'm going to look and in fact I'll begin with this one as a slightly simpler but still extremely difficult problem modernization of totally non-gaussian but agar dick random processes such as turbulence now why because people have been doing experiments to try to model such turbulent fluid as the one you see here this is an astronaut physics this is a two-dimensional vorticity field and this is bubbles and the images that you see below are synthesis from such networks so how do they do that they take an image they first train the network on a data basis of images which have nothing to do with these particular images once the network is trained particular called image net you take the image you compute this one example you compute all the coefficients of this image in the network within the different layers now you look at one layer and then you compute the correlations of two images within one layer at different channel positions so you get a correlation coefficient you do that for any pair of channels then you have a statistical description of your random processes through these correlations and now you generate a new relation of your random process by beginning from white noise modifying the random process up to the point that it has the same correlation properties and then you look at the relation and it looks like these things it looks like relation of turbulence bubbles and so so the question is why what is happening here it's reproducing something that looks like a turbulent fluid that was an Astrophysical it reproduce an image which is totally different but looks like turbulence this was computed from this for any input white noise and you do the optimization you get a different image ds1 so you get a random process which seems to have similar property question why what what happened what kind of model did you build other type of modernization of random processes for totally non ergodic process what do they do they take an image they use such a network in order to have an output which looks like a Gaussian white noise then they invert the network so that from this Gaussian white noise they recover something that looks like the original image that's called autoencoders so they train that let's say on faces or bedrooms and then they synthesize new image by creating a new Gaussian white noise and applying this decoder now if you train that on bedrooms for example you have a databases of bedrooms you put a new white noise and you get a new bedroom and a new white noise you get a new bedroom now the most surprising thing is then you do a linear interpolations between these two white noise okay just a linear interpolations and from any of these linear interpolation you plug it here you reconstruct and what you get the kind of new bedroom and at any stage it looks like a bedroom what does that mean that means that in that space you have a kind of representation of bedrooms which is now a totally flat manifold because the average of two bedroom is a bedroom so you have completely flatten out your points representing bedroom which is a wild set of points in the original space into something flat okay what's happening why so these will be the questions that will be organizing the talks the first one will be about modulation of random processes and these ideas of scale separations what I'll be showing is that scale separation is at the core of the ability to reduce this curse of dimensionality and one of the very difficult problem in this field in math is to understand interactions across scales and I'll be trying to show wide nonlinearities in this system provide these interactions across skills then the second topic will be in relation to the regularity of action of the film or fizzle they will look at problems such as classifications regressions of energies in quantum chemistry and will see the kind of role that this has and again what kind of mass comes out the last one will be about the modernization of these random processes such as these bedrooms but we'll see is that in some sense it looks like these networks build a kind of memory and the notion of sparsity is something that will be important so these will be the three stage of the talk so let me begin with scale so why is scale separation so important this is a very well-known in physics when you have an N body problem so a priori long-range interaction all the bodies are interacting you can think of your bodies as being particle but it could be pixels it could be agents in a social networks how can you summarize reduced interactions let's say with a central particle here where the very strong interactions with the neighbors ok your family the neighbor particles or or pixels then with the more far away particle you know if what you can do is instead of looking at the interaction of each particle with this one you can aggregate this interaction construct the equivalent field and look at the interaction of the group with this particle with even more far away particle and these are called multi palled methods you can even regroup larger amount of particles and summarize the interaction with a single term for example think let's say if if you are in the social network we are six billion inhabitants on the earth you cannot make like even people there which are very far away let's say some Chinese living some somewhere in China because if you neglect Chinese then you neglect Cheney and if China let's say has some particle tension with whatever friends or your country that can have on your life but you don't need to look at each Chinese but the aggregation as China so this idea of multi scale aggregation allows to reduce the interactions into log-log the components what is how what is very difficult what it's very difficult is to understand the interactions of the groups in other words the interactions in cross scale what has been well understood for a long time is how to do scale separations and wavelets are the best tools to do that what is essentially not understood since the 1970s is how to model capture scale interactions and what I'll be trying to show you that this is completely central within these networks okay so how do you build scale separation the way you do that is by introducing small waves which are wavelets which basically looks like a Gaussian modulated by a cosine or a sine wave okay and what you do is you are going to scale these wavelets like that and in two dimension you will rotate them so you get a wavelet for any angle and dilation and what you do is you take your data you explode it along different scale and rotation by doing a convolution like in these networks so how does it look like in the Fourier domain a convolution is a product so basically you are going to filter your Fourier transform into a
channel like that and when you change the angle of the wave let you basically rotates the for you support when you dilate the wavelets you dilate the word the for you support so you explode the information if you look at it in the Fourier domain into different frequency channels now if you want to model a random process through let's say look at correlation what you will observe is that the wavelet coefficient that to different scale or angles are not correlated if the random process is stationary why because they live in two different frequency Channel and a simple calculation shows that because the support of the wavelets in Fourier are operated the correlation is zero okay let me look at an example this is an image these are the wavelet coefficients at the first scale basically gray zero white positive black negative you have large coefficient near the edges this is the average then you compute the next scale wavelet coefficients next scale what you see is that most coefficients are very small nearly zero but they look very much alike across scale yet they are not correlated so you are unable to capture the dependence across scale just with a simple linear operator such as a correlation okay in statistical physics how are you going to model random process the standard way to model a random process is to compute moments so what is a moment it's the expected value of some transformation Phi M of your random field you compute the expected value and then you define a probability distribution which satisfy these moments and which has a maximum entropy which is a way to express that you have not any more information you look at all possible configurations having that moment and what you can rattle easily show is that then you get a gibbs distribution and this gibbs distribution is defined by Lagrange multipliers which are adjusted in order to satisfy these moments now what have people been doing mostly until now is to compute moments which are basically correlation moments and that's exactly what the Kolmogorov model of turbulence is about if you do that then what you are going to get here is a bilinear function and therefore sorry you are going to get a Gaussian distribution so if you look at a gaussian model of turbulence that's what you're going to get the images which are below the images which are below have exactly the same moments than the one above so there are the maximum entropy model constrained by second order moments same spectrum that you've lost all the geometry of the structure so what have people been trying to do it in statistics go to high order moments but if you go to high order moments you have many moments they have a huge variance and the estimators are in fact very bad because of the variance deep networks seems to get view estimators which are look much better why the key point here will be the non-linearity what I want to show here is that the non-linearity is what builds the relation across scale and the key way you are going to relate scales is through phase this is what is the link between scale if you take a wavelet which has a certain phase alpha I'll call it that way I'm going to build a network where I'm going to impose that the filters are wavelets okay so I take my eggs I filter it with a wavelet and I apply a rectifier what's happening if you do that let's look at this convolution I convolve it with with a wavelet which has a certain phase I can get out the modulus of the convolution and I'm going to have cosine which depends upon the face of the wavelet and the phase of the convolution now what's happening when you put a rectifier the rectifier is a homogeneous operator the modulus you can get it out the rectifier only transforms the phase by essentially killing the negative coefficients you can view the rectifier as being a window on the face it eliminates all the phase which corresponds to negative coefficients and keeps the phase corresponding to positive coefficients now what if you now do a Fourier transform relatively to this phase variable alpha what you see appearing is that after applying your rectifier if I do take my coefficient and do a Fourier transform relatively to the face variable IC appearing the modulus of the output of the filtering I see a being the face but each face is multiplied by a harmonic component K so you do something very nonlinear which is you create all kind of harmonics of your face now why is that fundamental if you want to model random processes you see if you take and I write it that way a convolution and you take the exponent to the power K of the face what you are going to do is essentially you're going to move the four years report so suppose that I look in one dimension have a random process and I look at the component on two different frequency intervals because I have two wavelets which leaves our different frequency they are not correlated these two components because the Fourier component don't interact if you apply a harmonic this frequency is going to move K cos 2 is going to moved here in to lambda K equals 3 here now if you look at these two component now they are correlated so after applying your non-linearity you create correlations because you move your Fourier support what does that mean that means that if you look over your domain where you've separated all the phase all the directions after applying the rectifier you can view it in that domain or in the Fourier domain which amounts to compute the harmonics all these blobs are going to be correlated you can correlate the coefficient within a given scale by just computing a standard correlation you can compute the correlation across to orientation by using the appropriate exponent which in that case is K equals 0 it's the modulus and you can compute the correlations across two different bands and if you look at that this is very close to calculations of the renormalization group renormalization group is what allows you to compute in a particular case of the easy models what kind of random processes you're going to are going to have and how you do it you do it by looking at interactions of the different skills numerically what do you have these are example of random process I'm now going to compute a maximum entropy process not conditioned on correlations but conditions on these nonlinear harmonic correlations these ones and that's what you get in the case of Ising at critical scale you can reproduce realizations of icings for turbulence you produce relations of random processes which are here contrarily to the gaussian case now you see that the structure geometrical structures appears because you've restored the alignment of phase and now one of the very beautiful questions is can we extend the calculation of
renormalization groups which we know how to do on icing on much more complex processes such as for example turbulence process in order to understand better the kind of property these random processes have and that's work that is being done with people at ENS in astrophysics in particular okay let me now move to the second problem the second problem is about classification so you want to classify for example digits one of the properties that you see in classification is when your digit moves is deformed if the deformation is not too big typically it will belong to the same class the three states of three or five states of five and if you take let's say paintings as long as that if you move on your DPhil morphism group as long as the de film orphism it's not too big basically you'll recognize the same painting then it will be another painting and if you move like that on the different morphism group you can go across essentially all european paintings that you may find in the Louvre or in particular okay so do you feel morphism is a key element of rigor if you want to approximate a function which is regular to the fuel morphism you want to build the Scriptures which are regular to the action of the film autism how can you do that X is a function in now - if you deform it it's not the distance with X is going to be very large how do you build regularity to the film orphism a very simple way is just to average X you average it let's let's say with a Gaussian and you are going to get a descriptor which is going to become very regular to the action of the film morphism as long as it's not too big radically - - - Agyei the problem is that if you do that you lose information because you've been averaging so how can you recover the information with which was lost the information which was lost other high frequencies that you can capture with wavelets but if then you average you are going to get 0 because these are oscillating functions how can you get a nonzero coefficient apply the rectifier which is positive and these will be coefficients which are again going to be regular - action of the film morphism but these wavelet coefficients you average them so you lost information by doing the averaging how can you recover the information that you've lost well you take these coefficients and you extract their high frequencies how can you do that again with wavelets and why are wavelets very natural here because if you want to be regular what is it if you morphism but if you're morphism is a local deformation a local dilation if you want to be regular to actions of the few morphism you have to separate scales and that's what the wavelets will do and you get a new set of coefficient and an averaging so that's going to look like a convolution neural network where you iterate convolution with wavelets non-linearity convolution with wavelet but in this network I don't learn the filters I impose the filters because I have a prior on my knowledge of the kind of regularity I to produce so one things that you can too you can prove is that if you build such a cascade where you take your function eggs and you deform you're going to get a representation which is Lipschitz continues to the action of the film morphism in what sense if X is deformed if you look at this coefficient this as a look at it as a vector if you look at the Euclidean distance between the representation of let's say the output of your network before and after the deformation the distance is going to be of the order of the size of the deformation and the weak topology of a de few more fizz 'm is defined by the size of the Jacobian of the deformation or the the translation operator that depends upon space and you can prove that you have something which is stable so to build something which is regular to deformation naturally leads you to scale separation again and to the use of these nonlinearities the question is how good will that be compared to deep network so you have a kind of network but you haven't learned the filters how good is that going to be compared to network where you learn everything the first problem I'm going to look at here is quantum chemistry now quantum chemistry is an interesting example because you have prior information on the type of function you want to approximate what do you know so the problem is the following X is the state of the system is described by a set of atom position in charge and the energy of let's say molecule you know that if you translate the molecule it's not going to change if you rotate the molecule it's not going to change so it's environmentally action and rotation if you slightly deform the molecule the energy is just going to change slightly so you have a regularity to the action of the film morphism question can I learn such a function from a database which gives me configuration of molecules and the value of the energy now if you look in quantum chemistry the way such energies are calculated is by using what is called DFT so the key idea is the following you take a molecule and you look at the electronic density of the molecule that's what I'm showing here the each gray level gives you the probability of occurrence of an electron at a given position okay and they are very close to the atoms but they're also in between two atoms because that's the chemical bounce which is here so to compute such a thing is requires to solve the Schrodinger equation so in that framework I suppose I don't know physics beyond these basic environments so I don't know short anger equation what we are going to do and there is now a whole community in physics and machine learning doing that kind of thing is here going to represent the molecule just by the States the only thing I know in X is the position of each atom and the number of electron on each other so I'm going to represent naively the electronic density as if each electron were sitting exactly where the core or the nuclei of the atom was so you get a kind of electronic density like that here I have no idea what chemistry is about then you build a learning system so you take your density in 3d and you compute a representation by separating all scale separating all angle applying a rectifiers and you get these kind of images of 3d blots which looks a little bit like orbitals then you apply your linear averaging which builds a number of descriptors which are of the order of the log of the dimension which our environment relation rotations then stable to deformation and where do you learn physics in the light stage where you just learn the weights of the linear combinations of all these descriptors to try to approximate the true energy of the molecule how do you learn these coefficients because you have a database
of example and you regress your coefficient on your database okay there are databases which have been constructed to test that kind of thing typical size about a hundred thirty thousand molecules these are organic molecules and what people do they compute their deep network where they learn everything and they compare the errors with the errors that numerical algorithm would do with the DFT and what people have been observing is that with that kind of technique you can get if the database is rich enough an air which is smaller than the air that is calculated by a typical numerical scheme with a DFT in that case we don't learn the filter we just say we know some regularity properties so the math leads to a certain certain type of filters and basically you get an air of which is of the same order so that shows that in that kind of examples you don't need you basically know what is being wet is learned what is learned is the type of regularity and the only thing that you what is learn in the filters and you can therefore replace them with wavelets and the only thing that you need to learn is basically the linear weights at the output however these are very let's say simple problems in the sense that's the kind of database that have been used until now our database of small molecules about 30 atoms yes sorry yeah when I was comparing here these are the net deep nets when I call ever deep net is when you learn everything when you learn everything you get an error of the order of 0.5 kilo calorie per mole when you don't learn anything you get an error of the same order if you use wave lights which are adapted to the kind of transformation where you want to be regular or environment in that kind of case the network's is in fact smaller because you you know exactly what you want basically in that case you know exactly the kind of filters you don't have to to build a big structure but again these are not horribly complicated problem and if you look in the world of images you can see the difference between simple and hard problem what is a simple problem recognizing digits so you have an image and you have a digit you have to recognize what kind of digit is differentiating textures which are uniform random process if you take a deep network and you learn everything or if you impose the filters and you impose that the filters are wavelets you get about the same kind of thing if then you move to something much more complicated such as that kind of image image net if you impose filters which are wavelets the arse is going to be in that case you have 1000 classes about 50% if you learn the filters the air is much smaller in 2012 that was the big results that began to attract very much attention there was 20% now it's about fibers so the question here is what is learned what is learned in these kind of networks and that's the last part what I would like to show is that there is a simple mathematical model which can capture the first order of what is learned is basically learning the Nerys to get sparse approximations so if you think at this domain this domain was called pattern recognition before before what is a pattern pattern is a structure which is approximating your signal and which is important for classification how can you think of decomposing a signal into in terms of patterns X is my data I'm going to define a dictionary of pattern each column of my matrix is a particular pattern to decompose X as a sum of a limited number of patterns can be written as a product of this matrix with a sparse vector Z which is mostly 0 besides the few patterns that you are going to select to represent X now how can you express such a problem this is a well-known field which is called sparse representations that has been studied since basically the 90s and one way you can specify this problem is by saying well X is going to be approximated by D multiplied by Z and I want that Z is sparse so I want to impose that the l1 norm of Z is small so I solve this optimization problem and I'm also going to impose that the coefficients of Z here are positive and here you have a convex minimization problem good so how can you solve such a convex minimization problems there are different type of algorithms to do that basically these are iterative algorithms they amount to do a gradient you have a convex problem so a gradient applied to the square norm term which is going to lead to that kind of matrix and you have your l1 term that you're going to minimize by doing a nonlinear projection it happens and in this case the nonlinear projector is exactly a rectifier so basically to solve that kind of problem is amount to apply a linear operator do apply rectify with what is called here bias and the bias corresponds to the Leghorn multiplier so to solve that you essentially compute a deep network a deep network where at each stage you are going to apply your matrix which is here a linear matrix and a rectifier and you iterate in this network the matrices are all the same they only depend upon the one matrix or dictionary okay now how can you use that for doing learning you can use that for doing learning by saying okay I would like to extract the best patterns that is going to lead to the best classification so I'm going to build a network where you compute your sparse coding then I put a classifier and I get my classification result and now what do you want to do you want to compute the best dictionary which is going to lead to the smallest error so you are going to optimize the weights in the dictionary D and the classifier so that over a given database of data and labels the classification error the loss is as small as possible and so you do a standard gradient descent so this is a standard neural network learning the only thing is that you are doing something which is from math point of view well understood you are just doing a sparse approximation where you are learning your matrix T and you can look at the convergence of that kind of thing okay so this is what wavelets are giving you if you don't learn every anything in your network you are going to do a
cascade with predefined wavelets compute your environ do your classification and you essentially have 50% of air what you could think is okay let's replace the wavelet representation with a dictionary so you learn a dictionary which is optimized in order to minimize the error and you don't improve much essentially you get the same kind of error now what if you cascade the two you first compute your representation with your environment which are now regular to the action of the film autism and then in that space you learn the dictionary and there you have a big drop of air which goes down to 18% which is essentially better than what was obtained in 2012 is this famous Aleks then what does that mean that means that there is really two elements there is one element which is due to the geometry that you know which are captured by translation rotation D few morphism you want to essentially reduce eliminate these variabilities once you've eliminated these variabilities you can define a set of patterns because otherwise you need to define a different pattern for any deformation any translation any rotation your dictionary gets absolutely huge now why you have such an error reduction is an open problem what you observe is in the output you have a kind of concentration phenomena that we don't quite understand but at that stage you can build a very much simpler models which is basically a cascade of two well-known understood sorry operators so the last applications I'm going to show you briefly is about these autoencoders so these are toe encoders they are able to basically synthesize random processes which are absolutely not ergodic and I gave this example of of bedrooms where you also see these deformation regularity properties one way to pose that problem is the following you begin with a random process X what essentially this encoder is doing is building a Gaussian white noise okay so you found a map which is building a Gaussian white noise this map is invertible I'm going to impose that this map is B Lipschitz by Lipschitz over the support of the random process the third properties you want that kind of deformation properties so you want your map to be Lipschitz continues to the actions of the few morsels questions how to build such a map how to invert it so how to build such a map we have constructed something which is regular to the action of the few morphism just before separating the scale and then doing a spatial average now why would that build something which is Gaussian because when you begin to average of a very large domain you begin to mix your random variables and if you if your random variable have a bit of the correlations if you average them over very large domain you have your central limit theorems which tells you that's going to converge to Gaussian random variables so then you do a linear operator which whitens your Gaussian and you can hope to get a Gaussian random variable now the question is the inversion is this map going to be by Lipschitz can you invert it and how does it relate to the previous topic so what is hard to invert here what is hard to invert it's not this first nonlinear part because you is the non-linearity is due to rectifier but if you have rectify of a and minus a which corresponds to two different phase you get back a so that's easy to invert the way that transform is invertible so that's not a problem what is hard to invert is this averaging the averaging is basically build a Gaussian process by reducing the dimension and mixing all the random variable and that's not an invertible operator now how can you invert a linear operator which lose information you can if you have some prior information about the sparsity of your data and that's called compressed sensing so how do you do that kind of thing same idea you need to learn a dictionary where your data is going to be sparse so the idea is the following your representation in some dictionaries you don't know what it is for now it's going to be sparse so can be represented as a sparse vector x d if that's the case the white noise that you've obtained by applying operator L is going to have itself a sparse representation so because you X is equal to Z in that dictionary the dictionary is now LD so what does that mean that means that in order to invert your map what you need is to compute the sparse code so compute this dictionary which is going to specify this w and then you can apply your dictionary D in virtue you are going to recover X how you do that how do you learn this dictionary well you learn this dictionary by essentially taking your examples and each time trying to reproduce to find the dictionary which is going to reproduce the best examples and that's done by optimizing the dictionary in this neural network it's again with a stochastic gradient descent so I'm going to show examples these are examples of faces okay the top images are the training examples on which you train your network and you optimize the dictionary and then you reconstruct these images that's what you see here then you try with new rays of your random process okay so you have new ization these are the testing image you decomposed now you use the dictionary that was computed from the first images and you try to reconstruct and indeed the reconstructed images looks good so that means that you indeed have inverted your by Lipschitz map you can train your network on network of database that's typically what an autoencoder is doing this is the training images these are the reconstructed one this is what you do on the testing images these are the reconstructed one now what's happening if you take two white noise from which you compute an image and you do a linear interpolation in the noise domain the noise domain in this case is essentially the domain of the scattering coefficients which have been averaged because they are again regular to the action of the film orphism when you do a linear interpolation and you reconstruct an image you see how basically you warp progressively one image into the other one if you do that on a different image and another one you see the same kind of thing if you do that on a bedroom you work one image into the other image now
synthesis so now you've represented again your random process with your scattering coefficient and you learned the dictionary which represents your coefficient in a sparse way now the synthesis amounts to produce randomly a Gaussian white noise throw this random white noise through this generator and compute an X so now you have computed you've defined random process okay this random process for different relation of white noise this is how they look like so you've produced a random generator of face that's what autoencoders are essentially doing what I'm showing here is you can do it just by learning a single matrix which is this dictionary if you train it on bedrooms doesn't look so good because bedrooms are much more complicated but you see that your ization looks a bit like geometric bedroom at least it doesn't look like faces what you get here is that essentially you have totally different types of stochastic processes models where everything is captured within the dictionary D and the random excitation defines the space of your free let's say variables or the entropy of the random process you can do things for example
and that's important in particular for simulations in physics you you may know when you compute for example in fluid dynamics you may have a very coarse grid approximation for example in climatology and you'd like to get a model of what's happening at fine scale so this is the input the very coarse scale approximation but you've learned your dictionary now you put noise and these are all the possible relations which have the same low frequencies and each relations correspond to different noise corresponds to different phase expressions so we are in a world which are very unusual from a math point of view but what I want to say is that there is a lot of math to be done here it's not just algorithms now if you come back on deep networks and the topic itself essentially that kind of structures they have the complexity of a Turing machine Turing machine is programmed by program here the program is essentially these weights and you have built let's say hundreds of millions or billions of words okay now in a Turing machine if you want to make a code which has no bug what you do is you structure your program you don't build a program with millions of lines of code currently we are training these machines directly by trying the millions of coefficients which makes them extremely complicated in some sense what I'm showing here is that when you begin to understand the math you can structure these machines and you see appear in different kind of functions the first shoe layers essentially in that case corresponds to reducing the geometry of the problems then you see appearing phenomena which are more related to sparsity and there are probably many many other phenomena that are absolutely not incorporated here now this being said the math problems remains very widely open with a kind of question you'd like to ask is again what is the regularity class what are the set of functions that can be approximated by such a networks what kind of approximation theorems how many neurons do you need to get an epsilon there these questions are totally not understood ok well thanks very much
[Applause] yes when you were showing the example of the reconstructed images between the training set of the testing set it seemed to me the tests reconstructed images were much closer to the original there in the training so that means that I've inverted the tool that's exactly what it means because normally it's the contrary yes you are right and it shows
that I inverted the the two columns Thanks no there's no miracle which is too bad
but there is no way you can do better thank you very much you Wow I'm very impressed yes yeah very interesting I did for this last topic of modeling in that work in terms of wavelets and then sparsity there is a kind of a prediction for how much information there should be in a network that comes from statistical learning theory they were like the generalization network to the amount of information you know very interesting to see if you're not enough nation in your sparsity but some some say about take universe also that's not much information about notions in this dictionary and so if that were to match up with the the information bounce that you get currently on I mean if you are referring to the information bounce that people have been computing on the networks they are essentially being computed on the norm of the operators and are extremely crude yeah that's right there's this if you go back yeah there's this target that people want to it yeah they just related control by the generalization the testing error yeah so yeah you're right that could be right now we have indeed to see how to understand the action of this learning of a dictionary the problem is much simpler than in usual Network because the learning is aggregated in a single matrix but because it's very nonlinear and you put it at each layer it's indeed not so simple so yeah you're right that's an interesting way to think about it yes in one thought you explained that dilution approximation is very difficult to capture higher correlations and escape phenomena and at the very end you have this this one demotion matrix that produce something with really high correlations so itself these are two point of view okay so it stupid the same okay let me explain the reason why you can compute correlation across scale angles and so on is because of the non-linearity which essentially realigns in some sense you can view it as a realignment of the face now in these synthesis what you have is the you have the dictionaries which are here so you can view these coefficients as exciting different patterns and then the different holy which are here are essentially performing our selecting randomly these patterns but you can also view it as a realignment of phase so the what is absolutely not there in the second one is the maximum entropy principle optimization there is a what we are it's much more complicated in the second one because I mean in the standard statistical physics framework because there is really no quantity that you are optimizing you are not controlling anything the second one is in fact very close to something which is used all over in signal processing which is auto regressive models think of it what how do you be build a simple Gaussian approximation you begin with your random process you whiten the random process with an operator and then you invert this operator you reconstruct your random process this is an order to what is to undone by an autoregressive filter this is the equivalent of an autoregressive filter but in a totally nonlinear way because you didn't begin with something Gaussian that's one way to view these or to encode methods to learning the implementing processes on graphs social networks for example that I don't know so you what do you mean learning the processes on class you mean what on graph sorry I didn't understand ok so what people have there are number of groups who have been trying to do that is basically what you need is to reintroduce the notion of translation deformations and scale on graph well the notion and all these notion can be defined it's just the translation operator is much more complicated before you can do that by going to the spectral you compute the spectrum of the graph with a Laplace Beltrami operator and from there you can do similar things so people have been trying to do that the math are more complex because translation operator on the graph is much more complicated okay [Music]