AI vs Information theory and learnability - TIB AV-Portal

AI vs Information theory and learnability

00:00

6

Related Material

Institut des Hautes Études Scientifiques (IHÉS)

Jacquet, Philippe

Formal Metadata

Title

AI vs Information theory and learnability

Title of Series

2nd workshop Nokia-IHES AI what's next

Number of Parts

5

Author

Jacquet, Philippe

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/53798 (DOI)

Publisher

Institut des Hautes Études Scientifiques (IHÉS)

Release Date

Language

Content Metadata

Subject Area

Computer Science Mathematics Physics

Genre

Workshop/Interactive Format

Abstract

We will first give a quick review of how information theory impacts AI, in particular how a complex system can evolve into a more complex system while satisfying the laws of information theory. Second we will investigate the problem of learnability. Deep neural networks are sometimes uncapable of learning surprisingly simple problems, we will try to hint a characterization of those problems.

2nd workshop Nokia-IHES AI what's next2 / 5

1

1:00:33

Multiscale Models for Image Classification and Physics with Deep Networks

2

1:02:32

AI vs Information theory and learnability

3

1:10:57

How will we do mathematics in 2030?

4

53:58

Recent Progress in End-to-End Learning for the Physical Layer

5

04:30

Introduction to 2nd NOKIA-IHES Workshop

Automatic playback

Speech

Text

Image

00:00

Bit rateTime evolutionAutomatonComplex systemComputational complexity theoryComputer virusThermal radiationVideo gameSpeciesCodierung <Programmierung>SpacetimePhysical systemWater vaporFunctional (mathematics)Key (cryptography)Factory (trading post)MereologyData compressionPower (physics)Artificial neural networkPoint (geometry)EvoluteSet (mathematics)AutomatonComputational complexity theoryRandomizationVideo gameEntropie <Informationstheorie>InferenceRule of inferenceCASE <Informatik>Projective planeCodierung <Programmierung>Term (mathematics)Cellular automatonAddress spaceProcess (computing)Descriptive statisticsCartesian coordinate systemCivil engineeringPrice indexRevision controlSelectivity (electronic)Complex systemTheory of relativityInformationPhysical lawMusical ensembleRange (statistics)Computer programmingDataflowBitLatent heatMultiplication signInformationstheorieComputer virusElectric generatorSpeciesCodeView (database)CodeResultantForestEndliche ModelltheorieMaxima and minimaMathematicsArithmetic meanSinc functionMoore's lawComputer animation

09:14

Video gameCodierung <Programmierung>MassPlanck constantData storage deviceTotal S.A.InformationCodeComputerComputational complexity theorySpeciesComplex systemParameter (computer programming)Turing testMachine learningMaizeVirtual machineRandomizationProcess (computing)Source codeArtificial neural networkGodComputational complexity theoryGoodness of fitTuring testMultiplication signSpeciesVideo gameReading (process)Mechanism designVirtual machineCASE <Informatik>Escape characterWave packetMaizeOrder (biology)Grass (card game)Drop (liquid)InformationMathematicsFreezingMereologyTwitterEquivalence relationNumberLine (geometry)CodeMassEndliche ModelltheorieSpacetimeSpecial unitary groupComputer scientistVolume (thermodynamics)Finite setCodeMetropolitan area networkLimit (category theory)Core dumpCivil engineeringResultantMathematicianInclusion mapBitWordDescriptive statisticsVideoconferencingMedical imagingInformationstheorieGeometryComputer configurationCubic graphComputer animation

18:29

InfinityBinary fileSequencePredictionAxiom of choiceComputational complexity theoryTuring testInformationLimit (category theory)AlgorithmPattern matchingNetwork topologyData structureGraph (mathematics)Function (mathematics)Parity (mathematics)System programmingData modelCivil engineeringLine (geometry)Artificial neural networkVirtual machineImplementationMachine learningStochasticGradientGradient descentRead-only memoryArtificial intelligenceWeightMatrix (mathematics)Vector spaceInsertion lossMaxima and minimaLocal ringBit ratePhase transitionCoefficientDifferent (Kate Ryan album)Wave packetMusical ensembleResultantPolygon meshMultiplication signMachine learningGodCASE <Informatik>Artificial neural networkComputer-assisted translation2 (number)Film editingSocial classComputational complexity theoryInformationstheoriePhysical systemCoefficient of determinationSymbol tableVirtual machineAlgorithmFinite setPolynomialEqualiser (mathematics)SequenceTuring-MaschineVideo gameLimit (category theory)PredictabilityAxiom of choiceMedical imagingInfinityEvoluteNumberBitPattern languageWeightPoint (geometry)Error messageArithmetic meanSemiconductor memoryParity (mathematics)Functional (mathematics)Table (information)Conjugacy classData conversionMatching (graph theory)ConvolutionNichtlineares GleichungssystemPole (complex analysis)Theory of relativityPattern matchingSimulationGradient descentAsynchronous Transfer ModeDescriptive statisticsMereologyCuboidSpecial unitary groupRight angleBinary codeSet (mathematics)Proof theoryProper mapNetwork topologyGraph (mathematics)GradientStochasticView (database)Matrix (mathematics)RandomizationComputer animation

27:43

Gradient descentStochasticGradientMaxima and minimaFunction (mathematics)Local ringVector spacePredictionWeightInsertion lossBit rateNumberWell-formed formulaArtificial neural networkRepresentation (politics)Matrix (mathematics)Saddle pointPoint (geometry)Nichtkommutative Jordan-AlgebraComputational complexity theorySpacetimeVector graphicsMathematical optimizationPermutationAlgorithmRootHausdorff dimensionData modelStochastic geometryHill differential equationConvex setArtificial neural networkPoint (geometry)GradientMaxima and minimaProof theoryGame theoryNumberWave packetGoodness of fitClosed setCycle (graph theory)Condition numberPhysical systemDimensional analysisFunctional (mathematics)Gradient descentRootInsertion lossMixed realityCombinational logicRecursionRegular graphCASE <Informatik>Block (periodic table)Multiplication signResultantLatin squareMereologyMusical ensemblePositional notationExpressionElement (mathematics)CoefficientSound effectProduct (business)MathematicsCoefficient of determinationDemonComputer-assisted translationMedical imagingArrow of timeNormal (geometry)Error messageLine (geometry)PermutationRandomizationCategory of beingVirtual machineMatrix (mathematics)AverageSaddle pointAxiomMathematical optimizationWeightVector spaceComputational complexity theoryRule of inferenceData conversionQuicksortRepresentation (politics)Computer animation

36:57

Proof theoryElement (mathematics)Hessian matrixFunction (mathematics)Convex setNeighbourhood (graph theory)Physical lawRootCorrelation and dependenceMetric systemSmoothingSaddle pointPoint (geometry)Data modelGradient descentWeightMatrix (mathematics)Vector spaceLimit (category theory)Law of large numbersFourier transformAreaEquivalence relationCategory of beingPhysicsComputer programRadical (chemistry)Rekursiv aufzählbare MengeIterationArtificial neural networkAlgorithmMetric systemVariety (linguistics)Water vaporConvex setMeasurementProof theory2 (number)CASE <Informatik>Functional (mathematics)Wave packetState of matterPlanningArtificial neural networkSlide rulePropositional formulaPhysical systemFlow separationMaxima and minimaData conversionArithmetic meanNumberCoefficientRow (database)Computer programmingCategory of beingPredictabilityVector spaceAlgorithmPhysicalismDimensional analysisInsertion lossLengthModule (mathematics)Physical lawError messageAreaSoftwareSquare numberRootOrder (biology)Software testingGradient descentEquivalence relationUniverse (mathematics)Escape characterMathematical optimizationWeightReal numberAverage40 (number)Endliche ModelltheorieSound effectPosition operatorRule of inferenceTouch typingPairwise comparisonLaw of large numbersResultantRadical (chemistry)FehlerschrankeRandom number generationMoment (mathematics)SequenceArrow of timeComputer animation

46:11

Computer-assisted translationError messageConvolutionPhysical systemPresentation of a groupVirtual machineLocal ringAlgorithmApproximationMaxima and minimaGradient descentWeightCASE <Informatik>Artificial neural networkRule of inferencePairwise comparisonOrder (biology)NumberCoefficient of determinationExecution unitVideo gameEvoluteInformationCoefficientProcess (computing)Translation (relic)Forcing (mathematics)Spherical cap3 (number)Arithmetic meanArrow of timeRight angleComputer programmingSurface of revolutionPlastikkarteSampling (statistics)Traffic reportingImplementationGoodness of fitMultiplication signExpressionMachine learningComputational complexity theoryHoaxNichtlineares GleichungssystemDifferent (Kate Ryan album)Lecture/Conference

53:06

Function (mathematics)WeightPoint (geometry)NumberMaxima and minimaBit rateUniformer RaumInsertion lossSquare numberFunctional (mathematics)Different (Kate Ryan album)Vector spaceError messageDimensional analysisRandomizationNumeral (linguistics)Limit (category theory)ExistenceQuicksortPosition operatorGradient descentWave packetExecution unitArtificial neural networkSlide ruleResultantGoodness of fit10 (number)Mixed realityIntegerSign (mathematics)Random number generationFinite setIterationWeightPhysical systemImplementationCASE <Informatik>Key (cryptography)System callTwitterClassical physicsReal numberLecture/Conference

01:00:01

SequencePredictionInfinityAxiomAxiom of choiceSampling (statistics)Equivalence relationFinite setSequenceTheory of relativityAxiom of choiceRight angleGodPredictabilityMultiplication signSocial classResultantRepresentation (politics)Point (geometry)Lecture/ConferenceComputer animation

01:01:54

InformationstheorieComputational complexity theoryEndliche ModelltheorieLecture/Conference

Transcript: English(auto-generated)

00:16

I'm the last talk of the day, so don't expect to have too much deep things.

00:24

Also, I'm going to talk about deep learning, but I'll have a quiet talk. There will be two parts, because I expect to have a slot of 45, one hour or so before I have to put two topics.

00:40

One topic is what is the relation between artificial intelligence and information theory. What information theory can tell about artificial intelligence. And I will get into a more involved part about learnability.

01:02

Last topic, which is relatively new, our own problem, learnable, that's a question that we are all asking ourselves. So first, if I want to continue the discussion we have with your talk, that basically one

01:20

question that Stephen Hawking asked himself and the other people, what will happen if artificial intelligence, since it's based on the technology, not bio-biology but electronic technology is supposed to be faster, what will happen if artificial intelligence supersedes

01:44

humans and basically could be the end of civilization and even worse, maybe the end of our job, research job, researchers, that's the question. This is very important topic, more for the young people because for people my age

02:04

it's important, but basically we'll tell the first thing that artificial intelligence tells in the fact that a system cannot evolve by itself into a more sophisticated

02:22

complex system. Because basically the entropy of the future of the system, since the future system is a function of the system itself, the entropy will be smaller, the entropy can only decrease. Therefore you cannot expect that if there is basically a program in the computer which

02:45

is just working with itself, evolves from itself, it will not going to do something better than what it was doing before. The only way to make a system, an automaton, to create an automaton which is more complex

03:02

than the original one, when you introduce entropy from outside. From entropy it could be randomness, it could be data, data set, and basically it was what artificial intelligence is doing, it's just collecting huge data and try to

03:22

convert into a system that will be better aligned with what is expected to do on this data. So far, the point of view of information theory is as if basically you add to your

03:41

original system randomness. And there is an equivalent in life evolution. Life evolution has worked from simple system into more complex system, ourselves, but animals

04:01

are more complex than ourselves, in fact it's very surprising. The birds are more complex than mammals, they are not doing math but they are more complex. In fact we don't know if they are not doing math. And the mutations are mostly random, and the mutation from cosmic radiation, also from

04:25

virus, virus are not random, but these are the mutations. And what happens, of course, evolution selects the best species. The problem, here is the main problem, what does it mean best species?

04:42

The best is who does survive the competition, it's a little, but it may be difficult, but the result, you can see the result every day when we walk in the forest, you can see that basically the system where we arrive is very complex, I think the human

05:06

is somewhere, in general we always put the human at the top of the evolution, so it's there but there's no reason that it will be at the top of the evolution. So what will, is it possible to apply this model, the model of life evolution, to the

05:22

evolution of codes, basically the art of intelligence, how codes can evolve to something more complex, more useful, so is it possible to have, to adapt this evolution of life to

05:41

codes generation? First, as we said, criterion to have a selection of species is not useful, because it is vague, totally legal, every time we just learn that a species evolved because of something completely impossible to imagine, that the nose was longer than because

06:04

of the flowers, this is for the butterflies, the butterfly, the nose, who survives, the nose was longer because the flowers start to be deeper and therefore the butterfly with short nose, they cannot survive, so the butterfly with longer nose survive,

06:22

but in this case the flowers become deeper and deeper and then at the end we don't understand very well the usefulness of the project. But can we create a digital ecosystem where we have a specified, basically,

06:40

rules for the codes to fight each other? Another question is, is it possible to have a large enough ecosystem? Because if you have only two or three codes competing, it won't be very useful, you have something large enough. Another problem, how can we verify, verify means certify, imagine that all these programs

07:04

generated randomly are used to control your car or the airplane where you are, when you are traveling, you would like at the minimum that to be certified.

07:20

Recently the car, Google car, just ran over a woman because the case that the woman was not at the right place, was not in the program. Of a concern it is something that is not existing, let's run on it. But now I would like to compare the power of life,

07:47

compared to the power of our computers. The rise of artificial intelligence has been done because we had the application of Moore law since three decades, when I say three decades,

08:04

we say four decades. So we arrived to a tremendous power, computational power. All the idea about deep learning, things like that, about neural network, are from the 60s. But from the 60s, basically there was less, less computational power in the computer than in this,

08:28

you know, that is already too sophisticated, than in this key, without the atomic part. So it was meaning less to imagine we could do

08:41

deep learning with basically a screwdriver. But now it is possible. But let's compare with life. Everybody has tried this. I tried this when I was young. You take a liter of water, sea water, and just you have a microscope, you look at

09:03

small animals or small shrimp or something like that, and it's like an ecosystem. This is one liter. And you can imagine that one day, one day, all the shrimps, they are going together,

09:21

the bacteria, the bacteria, the main source of mutation of the bacteria, the bacteria, and they generate one kilobyte of new muted code, called every liter per day. That's an assumption. Why this assumption? In fact, it is true. It is true. When I did the first

09:46

talk, it was just but random, and in fact was proved that it was true. So you have one kilobyte of code every day per liter, and basically life appeared

10:02

1,000 billion days ago. The volume of ocean is 1 billion of cubic kilometers, and every cubic kilometer is 1,000 billion liters. I check it again, the math are correct.

10:22

Therefore, the user space of life on Earth is a space of 10 to 36 kilobytes, and this is a lot.

10:43

I can tell you it is really a lot. Imagine that you have a magic computer, then you can store one byte, yes exactly, one byte per atom. 10 to 36 kilobyte will need a mountain of the size of Everest

11:02

with one kilobyte per atom. And if you look at the quantity of information related by mankind, this talk is of two years old, so you have to multiply by 100 of course. So it is a order of less than, as we said, a mole of information. A mole of information will be

11:25

10 to 33. Here it is 10 to 18. This is equivalent in our model to store all this information in the sun grain of a mass of 10 to minus 8 kilograms. And if you look,

11:43

because of most of the information created by mankind is streaming, is videos, is of course books, is encyclopedia. But if you look at the code, the working code, executable code,

12:04

it is much less. The number of executable codes are further replicated, but the number of original executable line of codes created by computer scientists is 100 gigabyte, which would be, compared to that, less than a nano drop of air, not a drop of air, nano drop of

12:25

air. Because I was expecting to speak during one hour, therefore I don't have enough material, so this is something has nothing to do with the discussion.

12:43

But anyhow, it shed light on the complexity of life. The human, compare the human being, the human being has a DNA of equivalent of six gigabits

13:02

of information and one baby, when there is a new baby, this is part is by grandpa part, you have a new baby coming and basically had to share 20,000 coding genes either

13:24

coming from mother or from the father. And therefore, if there's only two possibility, one possibility, two possibility, father from father or from mother, the complexity added to mankind by a baby is 20 kilobits. It is less

13:50

than the collective information contained in the SMS or in the tweet you used to announce the birth of the kill. Don't tell that to your children. I did that and it was a mistake.

14:11

If you take a vegetal, something strange, a vegetal have a DNA which is larger than the DNA of mammals, which is 20 gigabits or the reason, nobody knows the reason of course, but the

14:27

potential reason is the fact that there's a change of climate which can occur effectively on earth. The animals can move and the vegetal cannot move by definition. Therefore,

14:40

they have to find their gene. There is one gene against warming or one gene against freezing and to activate it. Therefore, every time there is a species tension, you move 20 gigabit to life complexity and every time there is a birth, you add only 20 kilobits.

15:04

Be careful when you walk or don't walk on grass before shaking. It is not something which is under extinction. The temporary conclusion,

15:21

the conclusion that this is a good news for us, that at least that we cannot expect even put into 2030 that our intelligence will be able to sever his tie with mankind.

15:41

Not yet. To do math, yes, that I believe, I do believe, but to evolve independently is not expected. Every time you show there is advertisement about the miracle of our intelligence, you have to have a small thought about all the engineers, the mathematician, the computer scientists who work

16:04

to make this happen and this is not only the training of a neural network. You have a lot of things to do in order to have a result. Or the bad news that if you want to get rid

16:25

of our civilization, of course, there are less complex ways to succeed. Ah, yes, all this because, just check, yes, not too late. So in case I was too long, I still

16:42

have a little, that's what I just said that this is what we'll say Shannon to Turing because we considered Turing invented our intelligence to tell, oh yes, your thing is very great, but there are limitations, be careful.

17:03

And in fact, it is not very true because Shannon also work on our intelligence and he designed one, the first machine to help for maze escape for something with mechanical mice. Before I thought it was a real mic basically because in the picture it was not clear.

17:25

And also designed a mind reading machine. The mind reading machine is basically to take advantage that the human brain is not a good source of randomness to be able to predict what is next move. It works very well. Of course, it's a illustration.

17:47

Ah, this is an exercise because maybe if you think it's too boring, you can think about this small exercise. This is an exercise. It is what Turing has replied to Shannon. Imagine

18:05

the problem. You can think by yourself during the rest of the talk. You have a ministry on top of Everest and there are monks. They are living there since ever. Their everyday job is to predict the weather of tomorrow via artificial intelligence.

18:25

You have the question, God accept only a finite number of mistake. And since you have an infinite number of days and before doomsday, God is counting and if it is more than finite, it won't be happy.

18:52

There is a binary. We considered to simplify our lives in binary weather, zero for bad weather, one for nice weather. The only data set is for each monk at every day

19:01

infinite seconds of past weather and prediction. The hint, the monks have to make a choice. You can think during the talk. Now, I'm going to go to more involved research, less philosophical. There will be less story about life evolution and things like that.

19:27

Anyhow, I'm still going to talk about cats and dogs. And the question is, is there a limit to learnability of cats and dogs? Because basically the symbol of artificial intelligence is the most successful. The second success of artificial intelligence is how to detect

19:44

there is a cat or a dog in a picture. Of course, no, it's more than that, fortunately. So the question, what I would like to do, I will ask my neural network to be able,

20:01

my machine learning system, to be able to discover an algorithm, a simple algorithm. Of course, because I have no way, I have no algorithm to detect if there is a cat or a dog in a picture. In fact, what I say is wrong, because a neural network is an algorithm,

20:24

so there is some mutation. But my question, what will be the consequence if machine learning will be incapable to discover a simple algorithm? First, it is not a very interesting result,

20:41

because as you said, it is completely pointless to use machine learning or artificial intelligence to mimic algorithms. If you have an algorithm, don't try to make machine learning, deep learning stuff, just use it. In fact, if it fails, it will be the only case when we have

21:05

polynomial p equal m p, because if you ask a machine to discover a polynomial algorithm, it will find a non-polynomial solution, something which will converge very slowly.

21:23

But in fact, it is not that useless to ask the question, because if the machine is not able to detect an algorithm or to apply it, it means that on some problem,

21:40

it will not be able to converge, because some problem will need that the data will be pre-processed by a specific algorithm. If the machine is not able to detect that you have to process the data, you have poor convergence. If the machine will be able to detect the algorithm

22:05

need to be applied and mimic the algorithm to have good convergence, you will have an universal solution. And of course, we will be back to the beginning of the talk that we have a universal machine that solves all problems. For the question, is machine learning and artificial intelligence capable to detect simple algorithms?

22:27

And in fact, the answer is no. There are problems where deep learning cannot converge

22:41

to a solution. The candidate is sorting, Fourier transform, for example, the classic example, wavelet, the convolution. You need basically to implement the convolution, the system without the convolution, you're not able to detect, you have to put at zero many coefficient and

23:01

you won't converge properly. Pattern matching, pattern matching is something that you cannot do very well. Tree on graph structure, this is a concept of pattern matching. And parity function has been proven that parity function, that the deep learning cannot converge to a

23:21

solution. Parity function is you have a sequence of n bit and the ground proof is one bit. For example, it's a parity of the number of non-zero bit. And you take a random parity function,

23:40

then your system, how long you train it, will give an average error of one half. It means that you won't converge and not even give a clue about the correct answer. First important point is the fact that the neural network is a Turing machine.

24:05

There are some units, we have to be careful because there are some questions about the memory. Therefore, you have to consider and need a recurrent neural network. But basically, this means that if you have an algorithm, just by adjusting the weight in your

24:28

neural network, you can implement the algorithm. But of course, the question is that the machine learning is not programming, it is training. And training via stochastic gradient descent can

24:44

be training, but for deep learning, it is training. We have already a free talk about this. Therefore, what I want to show you, because I will show you some cats and dogs, and this is a nice part of my talk. Basically, you can see a neural network like a box

25:05

with a neural network with weights, it's matrices, and I'm going to a more sophisticated description. But basically, view from outside, it will look like a coffee machine. If you consider this design, of course, you have a design. And all the weight of neural networks

25:26

are in the machine. You show a cat, it will answer you, cat. It will answer you, cat with a weight, cat with gravity, 100%, 90%. You show you another cat. Even this cat,

25:42

you see this cat is very troubling. It's a cat. You have to challenge the system, you show a dog. Here, I don't know what will be the result, but I assume that it will say a 50% cat, 50% dog.

26:07

Yes, there's no limit to human imagination to fool the machine. But how it works? Basically, you enter an image. An image is a sequence of numbers. It can be one million numbers.

26:25

Here, I just put 10 numbers. The machine produces a prediction with the neural network which is inside. And knowing that the true result is 56, it will compute the difference between

26:47

the true, the ground truth, and the prediction. And during the training phase, it will adjust the coefficient inside the machine so that it will be always closer and closer to the ground truth. But every time, we change the training image. Therefore, it will be a

27:08

gradient descent, and we say stochastic descent because the image we select is always random. So it is stochastic. It will be like that. And it's a

27:20

gradient stochastic machine descent. You simplify your life. You consider that you know the result. You compute the gradient. We know this is a justice system. There is a matrix and some activation function, something that is mathematically completely trivial.

27:41

You can compute the gradient and then adjust the weight according to the gradient of the loss function. So just you reduce, you take the weight, the gradient, the negative way so that you expect to reduce the error. This is the average error. Here there are two examples.

28:01

One example is when you adjust the gradient vector. This is normal. It's not interesting. And the question, the big question I'm going to ask myself,

28:22

well YouTube, basically, can we train a machine neural network to extract the maximum of two numbers? So this is not a question of cats, not a question of dogs, not a question to supersede humanity. Mankind is I want to have two numbers. I want to be able to extract

28:41

the maximum of two numbers, something which is very simple. It turns out that you have this expression of the maximum of two numbers. And this expression, in fact, this is a neural network because if you express it as a neural network, you use as the activation function,

29:05

everybody use in theory that nobody use in practice is a rule. And it has this matrix representation. This means that I always take the positive part of the vector and that I

29:21

apply again another matrix and this is equal to the maximum, twice the maximum, the two numbers. Therefore, it satisfies the property that neural networks is a trained machine and it can extract the maximum of two numbers. And not too much effort, even at that time of

29:45

the afternoon, you can see that you can extract the maximum more than two numbers, four numbers. Here I separate the case of eight numbers. So just basically you have a specific

30:05

block to extract the maximum two numbers. You have a recursion way to extract the maximum of any numbers. So nothing special. You take log n layers. It's very simple.

30:26

There are several ways to adjust this neural network because instead you can take the maximum of the numbers and separate what is above and below but you can mix it. And in fact,

30:44

it is not innocent. You have many combination of the neural network to give you the correct answer. So maybe it is a reason why we will see some problem. First, how good is a gradient descent?

31:02

First, it is absolutely impossible even for a simple problem to get the optimal neural network. You will always end on a local minima. It is a gradient descent. As a result, you go down, go down. At some moment, it stops because it's just on a local minima. It is not a global minima. There are a lot of stories that

31:25

shake up the system. You fail and then you go lower. But it's always the same story. You will never, you have not a lot of chance to end on the local minima. Even worse,

31:47

since we are on a huge dimension, you can have vicious saddle points. Saddle points where basically it is very unstable but in fact you go one step here and then up you go there and

32:01

you just do a cycle around the saddle point. It is in theory but in practice nobody has seen that. If the local minima are close to each other in the sub-definition of closeness,

32:21

it's good. If the local minima are far from each other, this is bad. So good, these are very close, far from each other, very bad. So the question, how training can reach a good weight vector, a good local minima?

32:44

So the weight vector, now it is something and it's playing the game. I will take a neural network but I'm not going to take a number and I'm not going to take cats.

33:01

You know what, if you had a small kitten, the first time you put a small kitten on a mirror and he shows his image, it's a panic. So what I do the same with a neural network, I put a neural network, a front against another neural network, which is known, which is adversarial.

33:25

This case is simple. I assume that the ground trough is given by a neural network, another neural network. And I will select the weight of this ground trough neural network at random and I will never touch them. And as I said, there are many, if you have

33:49

for the root of the loss function, you have many competitors because you have many permutations of lines, columns that are possible candidate for being the optimal neural network.

34:03

Therefore, the loss function has a number of roots which is factorial n, which is big basically because this n is one million. So the exercise, you take an aquarium of large dimension and this will be the place where you

34:26

select the roots of your loss function. In fact, the roots of the function are correlated, but you assume they are not correlated at all, which is sufficient for the result.

34:40

And you define a simple loss function that has nothing to do with the gradient descent, which is not this product, for example. We can take something more complicated, but it is in fact very easy to do the math on this. And you prove that basically there is quasi-black hole effect that the minimum, the local minimum, is not a global minima,

35:06

but is something which is a sort of the global minima. Each of these points is a global minima, but the local minima, it is a sort of it, and it always converts to this local minima,

35:20

which is not a good local minima, but it always converts to this one. Or there are some experiments and it is working, yes. Yes, it's all very interesting, so that when you have less than four, you are in dimension 10 and you have less

35:45

four global minima, it works very well. It's always, gradient descent always converts to the global minima, but if you take 10, for example, then it will converge to the centroid.

36:02

Something which looks like the centroid, which is not the centroid, is something which is more complicated, but if you look aesthetically, when the dimension increases, it tends to the centroid. No, no, this is for any kind of, then we look at how it applies the maximum.

36:25

This is not for the maximum, it is a more general proof for any training on a neural network whose ground trough is given by another neural network,

36:41

whose coefficient is selected randomly, and this is an element of proof that basically will show that the gradient, with high properties, the gradient converts to the origin, in fact, converts to the average, the centroid of the root, and that the centroid, ah,

37:03

look at that. How long you can look at that without, ah, it worked. Just to prove that the function is convex with high probability around this centroid. Therefore, if you enter the centroid, you will never escape.

37:25

In fact, this proof works for every, if the roots are correlated, which is the case, work for many metrics, but you have to be careful, in this case, our loss function is smooth.

37:43

In fact, the loss function is not smooth, it's something very complicated. Therefore, it is, we cannot say that it is true for the real gradient descent in for neural network, but assume that it is true.

38:02

This tells you that if you convert to the centroid, if the average, if the centroid is not zero, then by the law of large number, you will do an error, but the error you will do will be basically, will fade away compared to the length of the

38:30

modules of the centroid, the normal centroid. I mean that if I take a text vector, and I apply the text vector x to the real neural network, I will be close to,

38:47

if I compare with what I will get with the centroid neural network, I will have something which will be very close via square root of n. In fact, you see this of order n,

39:00

those error is order one over square root of n, n being the dimension, you will be very close, basically. And the problem happens if your centroid is zero, or very close to zero, this is a bad news. In this case, therefore the error, the relative error you will do,

39:29

so this is impressive, the relative error you will do compared with the exact prediction will be too important in the relative error, in fact it will be infinite. And

39:47

if you took the maximum, the maximum, the coefficient, the average, you have to take the average of the coefficient of the matrix, you see that the average coefficient is zero. So you expect that the max finding will be a special case where

40:09

basically the centroid will be zero, and therefore you expect not good result. And it turns out that many algorithms and signed numbers have this property

40:24

that we call the zero mean weight. Here it is the training over the max, so all this I trained nine neural networks randomly selected, and the position in the

40:46

plan is that the basically the vector of their prediction on two test vectors that have I don't touch. And the red arrow is the ground trough, so it works well for two numbers.

41:05

So two numbers converge well. Four numbers converge but less well. And then 8, 16,

41:22

32, there is this effect that we assume to be the effect of the zero mean weight property starts to be a problem. To compare, we try a zero mean weight random neural network,

41:43

not the maximum neural network, and turns out that what is done is going very bad. In comparison, if you take a non-zero mean weight function, it converges because it's

42:01

basically the law of number, there is nothing. It's working very well. So basically, if this is true, because as I said, I use a toy model to show it very quickly.

42:23

If it's true, it means that there is a swamp area in the learning if you have a problem, and if you are ground distant, go into the area that I call the swamp area, where the coefficient has a weight zero, weight can be the mean weight, can be the

42:43

third moment weight, can be anything like that, then it doesn't converge, it can be blocked. And the convergence will be very poor. And if the solution is optimal,

43:03

network R is in this area, then you can expect that it won't converge. It means that your system won't be able to find the minimum of several numbers.

43:23

Now there is a kind of equivalence, the conclusion. The conclusion is that we learn learnability. There's a kind of equivalence of programming and learning. We know that from Turing that the program termination is undecidable in general,

43:43

that we can expect that the learning convergence may also be undecidable in general. There are some propositions in this area. For instance, it is easy to state that the problem is not terminated,

44:05

but the other difficulty is how to define a bad convergence that doesn't converge well. Of course, this is another difficulty. But we know that we can prove that some program terminates, fortunately. For example, the problem is controlling your plane or your car,

44:22

it is proven to terminate, sometimes. Therefore, can we hope to have some way, some special case to detect that the problem will not converge well? And how is it possible to learn,

44:44

to train neural network to detect which algorithm you should use in order to escape the swamp area? There was also the question that I was asking, that Stefan started to look at,

45:03

can we train neural network to detect which physical laws apply to obtain some physical measures? You have some measures in astrophysics, but you do not know how to perceive the sequence of physical law to apply, or which physical, is there a new physical law?

45:24

And this is a very interesting question, because if we consider the physical law is like a basic algorithm, just a variety, it may be also difficult to

45:41

find the basic physical law. But this is a question. So that's all that I wanted to show you. If you want to know the solution of the problem for the monks, there are one slide after, but I think you know the answer already,

46:06

so I'm not going to go into these slides.

46:23

In your comparison between life, evolution and AI and machine learning, as you said, life has no goal, no purpose of evolution, that could be stated by Darwin.

46:42

But AI surely has one goal, you would tell your machine what you want it to achieve, right? So would that make a huge difference? Because you said life or evolution has done a lot of computations, basically. No particular direction, so could that

47:07

make AI more, you know, help it run faster? Yes, I agree, and the question was, can artificial intelligence get rid of human intervention? That was a basic question,

47:26

so you give the answer. If you are able, a human is able to give the rules, if it works, of course it is working. But if you are doing very vague rules, you will need more computation. If you don't know the rules you have to give,

47:49

you know the famous expression, there's a command, do what I mean on the computer, because when you want to debug a program and at the end you are fed up, you write do what I mean, it doesn't work. So in this case, what is the minimal quantity of information to have

48:05

something working in that? I don't know. This is not sufficient to say, make me happy, I think. Yes, it's the end of the day, but I had predicted that.

48:31

One small question, maybe it's naive, but in all of these problems where you ask for exact precision, like say, are these two numbers equal? So it's even simpler than the maximum

48:44

problem. I mean, there's no hope in any case that gradient descent algorithm could converge to require infinite numerical precision. How does this relate to what you actually say in your talk?

49:07

No, what I said is if I need an approximation of the algorithm... But it doesn't exist in approximation of... No, no, an approximation of the maximum and it will help to find the correct neural network, then I'm happy.

49:30

If I have something to give me the maximum with an error of, I don't know, five percent, I'm beyond what you expect with the present deep learning system. I mean, if...

49:50

If I take as an example, I assume that it is impossible to find the convolution algorithm, the algorithm that makes the convolution, but I know that I need it to be able to

50:04

recognize a cat because it has been designed with a convolution algorithm. If I train my neural network on the picture of cat without this convolution algorithm

50:20

inside, it won't work, it won't converge. It's a bet because I never tried. If you have the convolution algorithm, but with not exactly the convolution algorithm, it's something which is very close, it will find the cat at the end because

50:41

you will just add the error of your imitation of the convolution algorithm to the error you have when you have already built in the weight of the neural network your convolution algorithm. Because although the convolution algorithm is a neural network, basically you force some coefficient to be zero and you force some other coefficient to be identical in translation,

51:06

that is a convolution algorithm. You force it, it works well. If your system will be able to discover that you need to have the coefficient to be zero and the non-zero coefficient to be

51:21

identical by translation, then you will be able to find that there is a cat in the picture and that there is a dog in the picture. You will just add the error of your fake convolution algorithm. This is just to say that you need to prepare your data on your neural network

51:45

in order to have the good convergence. Because you don't know, if you don't know the very beginning, then you have to use a convolution. The guy who discovered the convolution is a very smart guy and maybe many people find it, but it is something he has,

52:03

the machine cannot discover if this is true. What happens if you try to learn max with a network with many hidden units, so not just four but you know hundreds? Because the usual...

52:20

Yeah, yeah, yeah, I see, in fact I try not to work, but maybe my neural network was not good, it doesn't work. What will work is to have the recurrent neural network, which is something another story is making. It means that you inject the data.

52:45

I mean that may be surprising because people usually say that if you have number of units less than or comparable to the amount of data, then there can be lots of local minimums, but as you take many more units, you don't have these bad local minimums.

53:00

If you add many layers, you will augment a number of... Not that layer, it could still be... Ah, sorry, I did not understand the question. It could still be two layers, but many units in that second layer. Many units, and you mean that you increase the dimension of the container? Yeah.

53:20

In the flavor, I don't have a definite answer for that. In fact, I have no definite answer, don't ask any further question by the way. I don't have a definite answer, but I will say no, just increase the number of possibilities and decrease the... Let's ask the limit where people would say that it should work, I think.

53:43

No, in this case, you will go again to the sort rate, but faster. And if you have the sort rate for the maximum, it's not good. Now, be careful. If you try to find the maximum of positive number, then it converges. But if you do a non-sine numbers, then it has a problem.

54:05

But it is my guess for the question, if you increase the size of the dimension, the dimension of the system, it will just... But I feel it will not converge.

54:20

You add more unknown to converge. Again, you require infinite numerical precision, because I could always give you two numbers, 7.0000001... Otherwise function, it's how closely does it... No, but the thing is, you can never get this exact,

54:42

you can always find for any implementation, you have any fixed number of neurons, I can always find two numbers which are so close together... No, no, this is not a question of numerical... My problem is not a question of numerical accuracy, if I show you this. But there is no neural network that you can train with gradient descent that solves the next problem.

55:01

This doesn't exist. No, I'm sorry. If I ask the maximum of 32 numbers, and this gives me this error, I will say it is not a problem of numerical instability. It is a problem that it doesn't converge. You hold your integers. Ah, yes, no, it is on numbers.

55:22

On there are only this finite set of numbers. No, no, it's not a train on random numbers, it's something so very simple, but it has nothing to do with numerical instability. It's my bet. It's just, it doesn't converge, it doesn't find the magic...

55:44

It doesn't have to be exact, but I mean it's just the last function, and how close did it get to the next? What does it mean, you are close to the next? Well, you have a size of the data, or a standard deviation, and the error is much less than that. Yes, you don't look at the accuracy of the weight,

56:00

you look at the accuracy of the answer of your neural networks when you test it with real vectors. Yes, because you may have neural networks that are very far apart, but when you test a neural network, they give something very similar. No, no, it is... What's the loss function for the max between two numbers?

56:21

The square of the difference between the max and what it gives. Yeah, but why, I mean, why is it if you compare two numbers that are super small, that you don't get a small error, but if you compare two numbers which are super large, but still close together, you know... No, no, the number between... The number is better than minus one and plus one. I didn't do something complicated.

56:42

No, I was not cheating like that. I can, I can, but no. I choose random numbers, 32 numbers between minus one and one, but real numbers, uniformly. There is no trick here, but I attack uniformly. If I have to look for two numbers,

57:02

it converts to something that will give an answer to be correct. And if I was rich enough to make the, to make it converging during a hundred thousand years, it will go to arbitrary precision.

57:20

But I'm not asking for that here, that I say. I'm not seeing that richness, by the way, but I did it long convergence. And these are stable, these points are stable. Completely stable. You can wait for,

57:41

you can add many thousand, multiply by tens number or iteration, and they are stable. This is the final, final result. They don't move anymore. Or they move because the learning rate is not zero, but the classic zigzag around the value,

58:04

nothing special. But maybe there is a bug. Maybe what I thought was completely wrong. That's a good subject of research. Yes, it was strange for the maximum. Sorry, I think I don't try to convince you

58:25

because I'm not convinced by the way though. Right, it's the same as asking are two numbers equal? No, no, it is not the mathematical question to say that are two numbers are equal. I have a function, I enter two numbers and give me an answer.

58:42

And the answer, I enter 0.45 and minus 0.03 and it gives me 0.1. So I consider that the answer is not acceptable. It was giving me 0.40, 0.44. I would be very happy, but the r is really big.

59:04

Really big and doesn't diminish if you select another initialization of your... It's not like in your case,

59:20

when you take any initialization, it goes to the good local minima. Here, you take a new initialization, it goes to another minima, which is not good, neither. I'll take this off. No, no, no, no, no, no, no, no, no. It's a spout and problem, finish. No, this is because I explained it, I didn't explain well, sorry.

59:41

But I think that you are tired. Me, I can continue with the problem. So, but he wants to see the last slide? Oh, the last slide. So I show the last slide just to close, to the last slide. I was not thinking it's a...

01:00:00

shame. But again, I get the hint by the way. It was very simple. The monk are just using the axial of choice. And I gave a wrong hint because I assumed that you have all the sequence of weather and prediction. In fact, you need all the

01:00:29

sequence of weather with the axial of choice. The monk can just figure out, ah, I have a sequence. I can feel the sequence to doomsday. I assume that

01:00:41

doomsday is tomorrow. And just consider that all the weather next to doomsday will be bad weather, so I put zero. And then, I don't have the right to show you

01:01:03

because a god apparently. So they use the axial of choice and basically all these sequence, they have one representative, given the axial of choice, and the monks are giving this representative. And the relation between

01:01:20

two sequence, two sequence are in relation if they differ by a finite number of point. Therefore, they all show the same resultant and the same resultant is in the same relation class of the sequence of weather. Therefore, the prediction will fail only a finite number of time and God

01:01:44

will be happy. Yes, but the trick is that we assume that the monks are able to manipulate infinite sequence. It was a toy example to prove that the information theory is not a complete theory because you need to

01:02:04

apply the computational ability of the data. She says it was a toy model with Fabien. Exactly. Don't ask me question about this.

01:02:21

Okay, that's actually... Can't you? Sorry, Fabi, so long.