We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Beyond the statistical perspective on deep learning, the toposic point of view: Invariance and semantic information

00:00

Formal Metadata

Title
Beyond the statistical perspective on deep learning, the toposic point of view: Invariance and semantic information
Title of Series
Number of Parts
31
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The last decade has witnessed an experimental revolution in data science and machine learning, essentially based on two ingredients: representation (or feature learning) and backpropagation. Moreover the analysis of the behavior of deep learning is essentially done through the prism of probabilities. As long as artificial neural networks only capture statistical correlations between data and the tasks/questions that have to be performed/answered, this analysis may be enough. Unfortunately, when we aim at designing neural networks that behave more like animal brains or even humans’ ones, statistics is not enough and we need to perform another type of analysis. By introducing languages and theories in this framework, we will show that the problem of learning is, first, a problem of adequacy between data and the theories that are expressed. This adequacy will be rephrased in terms of toposes. We will unveil the relation between the so-called “generalization” and a stack that models this adequacy between data and the tasks. Finally a five level perspective of learning with neural networks will be given that is based on the architecture (base site), a presemantic (fibration), languages, theories and the notion of semantic information (joint work with Daniel Bennequin).
Invariant (mathematics)Term (mathematics)MereologyInvariantentheorieGroup actionCharacteristic polynomialDecision theoryAlgebraic structureTranslation (relic)Stochastic kernel estimationLocal GroupNumber theoryPlanar graphNumerical analysisPlane (geometry)Function (mathematics)Equivalence relationMorley's categoricity theoremCategory of beingAbelian categoryFunktorTopostheorieTheoryElement (mathematics)MorphismusOrbitMusical ensembleFunctional (mathematics)Numerical analysisFunktorTheoryGroup actionCategory of beingMorphismusMereologyPoint (geometry)TopostheorieMathematical analysisTheory of relativityPerspective (visual)Propositional formulaRight angleDifferent (Kate Ryan album)StatisticsMaxima and minimaSpacetimeElement (mathematics)Algebraic structurePower (physics)Product (business)Equivalence relationFunction (mathematics)Sigma-algebraHypothesisStability theoryCalculusWave packetVertex (graph theory)Inclusion mapLogicWeightObject (grammar)Lattice (group)Fiber (mathematics)Sheaf (mathematics)InvariantentheorieTranslation (relic)GenderPermutationTerm (mathematics)OrbitIsometrie <Mathematik>Graph coloringSet theoryClassical physicsCartesian productShift operatorAssociative propertyTime seriesNetwork topologyGrothendieck topologyStochastic kernel estimationCanonical ensemblePotenz <Mathematik>AxiomLinear mapDivisorBijectionNatural numberDecision theoryAutocovarianceTheoremDirected graphProof theoryDistribution (mathematics)Plane (geometry)HistogramMeasurementMathematicsMultiplication signCharacteristic polynomialGradientExtension (kinesiology)Maß <Mathematik>Order (biology)Dihedral groupChainAlpha (investment)Zyklische GruppeCompact spaceDirection (geometry)ApproximationSymmetric groupUniverse (mathematics)Insertion lossVector potentialFamilyTransformation (genetics)LinearizationRule of inferenceThetafunktionPredicate logicHyperbolic functionPropositional calculusVibrationProbability distributionCondition numberTensorproduktData transmissionContinuous functionGamma functionGraph (mathematics)Dependent and independent variablesCurveCircleModel theoryModule (mathematics)Sign (mathematics)Correspondence (mathematics)Arrow of timeRecurrence relationExponentiationProjective planeLoop (music)Process (computing)Well-formed formulaConvolutionVector spaceGroup representationResultantSampling (statistics)Symmetry (physics)ExpressionAngleMatrix (mathematics)Mass flow rateConfiguration spaceComplete metric spaceCommutator1 (number)LengthStatistical hypothesis testingConnected spaceSlide ruleLine (geometry)Connectivity (graph theory)Figurate numberPredicate (grammar)MathematicianFlow separationReduction of orderWaveletManifoldLinear regressionAreaDimensional analysisVarianceEuklidischer RaumDistanceThermodynamisches SystemCombinatory logicStandard errorPropagatorForcing (mathematics)CausalitySocial classContrast (vision)Boolean algebraGradient descentVariable (mathematics)CombinatoricsElementary arithmeticScalar fieldCartesian closed categoryPartial derivativeChain rulePresentation of a groupCanonical commutation relationMonoidal categoryMathematical optimizationSymmetry groupExtrapolationStandard deviationInterpolationGroupoidArithmetic meanSquare numberCycle (graph theory)Prime idealOrientation (vector space)Price indexSurgeryEntropy2 (number)Series (mathematics)Computer animationMeeting/InterviewDiagram
Transcript: English(auto-generated)
Jean-Claude, thank you very much Laurent for the kind introduction. Thanks. So as Laurent said,
this is the title of my presentation, Beyond the Statistical Perspective on Deep Learning, which is in fact the usual one nowadays, The Toposic Point of View in Variance and Semantic Information. So I just wanted to notice that there is an archived paper that
is a common work with Daniel Benquin that has been, I mean, which is available since last night in archives. So yeah, so it is effectively a joint work with Daniel Benquin and I would like also to thank a lot Laurent Lefort first who opened to me this fantastic world
of toposic in 2017 and also Olivia Carmelo who has helped us a lot to understand some notions
that I will introduce afterwards. So just a brief introduction on AI and machine learning for our mathematician colleagues. I need to say that I'm not at the origin of a mathematician, but I'm coming from the electrical engineering world, so it's in general quite far away of
those things on toposis and this is through, in fact it is through AI, so artificial intelligence and machine learning that I have been confronted to toposis and other mathematical
notions, other related mathematical notions. So what is machine learning? In fact, very briefly it can be, let's say, it can be divided, subdivided into three basic tasks.
The first one is called the supervised learning, which in general is used to perform a classification or regression and that means, for example, that you want to recognize, for example, in an image if there is a cat or not. So just use what I call labeled input data,
that means it's images where someone has labeled them with cats or non-cats and so this is used for the machine to be trained. Then so you train
the model by using these labeled input data and finally you test it with new images and you hope that you'll be able to recognize even in new images that haven't been part of training data that you can recognize a cat or not. Regression is, let's say,
related to more continuous problems. Then there is also unsupervised learning, so you don't have any labeled data, you have basic raw data and the idea is to perform grouping of, it can be grouping, it can be dimension reduction or discrimination. So you don't know that what
you are considering is a cat because, in fact, this world is completely unknown to you and what you want to do is to discriminate, for example, cats and non-cats based on some, let's say, for example, cats will be in some manifolds and non-cats will be outside this
manifold and you want to understand patterns and to discover the outputs. And then there is another one which is maybe very close to the way animals or humans are behaving which is reinforcement learning where you have an agent that interacts with its environment
by performing actions and learning from errors and rewards. So it's a trial and error method. So it's, in some sense, reinforcement learning, it can be considered as a kind of supervised learning since when you have rewards or errors, in fact, there is someone,
something that says that, I mean, that it is an error or that gives you some reward. And so that's why it's quite close to supervised learning. In this presentation,
we will essentially consider the supervised learning case because maybe it's the simplest one to understand in terms of purposes but also the other ones can be understood this way. Okay, so in order to perform these tasks of machine learning, in fact, the most popular way of doing that and most successful is using neural networks. So what is a neural network? So
this is an example of what is called fully connected deep neural network where you have, in fact, here. So this is the simplest case where you have input data here at the input,
so of this neural network, and you have layers, basically. So input layer, first hidden layer, second hidden layer, etc. And at the output, you have what is called output layer. So, I'm sorry, it's a little bit small, but the output of a neuron yj will be, in fact,
a kind of, you perform, you compute a linear combination of the inputs xi by using what are called weights wij, and these weights will be, in fact, learned using training data plus what is called a bias. So, in fact, instead of being linear, it's affine, an affine transformation.
And then you apply to this number, you apply a nonlinear function phi, which is called an activation function, and that could be sigmoid, so this one, hyperbolic tangent rectified linear units, which is the third possibility, and some other more exotic activation functions.
So, in fact, depending on the problem, you choose the one that is best suited. So, in order to train neural network, that means in order to compute the weights and the
devices, in fact, you need to, in fact, what is the most popular algorithm is called back propagation. It dates from the 90s, and, in fact, what you're doing is that you compute a loss function, which can be based either on the Kullback-Leibler cross-entropy or
mutual information or maybe also some other loss functions are also possible, even some of them are just based on Euclidean distance. And, in fact, this loss function has
variables, which will be the weights and devices of the neural network. And then you find the idea is to find the minimum of such function by using the label data, so the label, the training data. And the idea is to use gradient descent, and thanks to the chain rule, in order to compute the partial derivatives, then the gradient calculus becomes very efficient and can be done layer by layer.
And so this is, on this figure, so you can see, let's say, a schematic view of back propagation. So I don't want to spend too much time on that. So you have seen a neural network based on fully connected layers, but you can have
other architectures when you try to deal with problems which are very specific. For example, what are called convolutional neural networks. I will come back to this architecture later on because it's something which is really related to a word on my
title, invariance. And the idea, in fact, is instead of considering fully, let's say, fully family of linear transformations,
when you are considering edges from one layer to another one, then the idea is to restrain this to some more specific linear transformation. And in this case, it will be convolutions. And so you will have some layers which are convolutions followed by max pooling, which is just a kind of, let's say, restriction.
And then at the end, you will have fully connected layers. I will explain later why we use this architecture. This one, the convolutional neural networks are basically used, in fact, for computer vision tasks, so for image processing, essentially.
Then you have recurrent neural networks where, in fact, here what you have are vectors. And the idea is to use a kind of, let's say, a kind of loop. If you unfold this loop,
then you obtain this kind of architecture. So, in fact, this architecture can be used when you have time series or when you have to consider, for example, natural language processing where you have some sentences which can be considered as a kind of
time series, finally. But basic recurrent neural networks are not good. They cannot be trained efficiently because when you consider gradient descent, then the gradients rapidly vanishes. And so that means that the loss function will not
become very low. And that means that you will have too many errors. So the idea is to consider some other kinds of cells in this recurrent neural network settings, which are called
long-short-term memory cells, LSTM cells. And so I will not spend too much time here as well. So as you can see, the idea is to have not only short-term memories, but also long-term memories in order to make the neural network able to be trained more efficiently.
Okay, so after this very brief introduction on machine learning and neural networks,
let's go into a toposive view of deep neural networks. So very briefly, what we have is that, in fact, it's possible first to model a neural network, let's say, by using, let's say, those Grothendieck toposites. It can be done
in several steps. The first one is based on the architecture of the DNN, of the deep neural network, and it will constitute the base site, the base Grothendieck site. So, in fact, our
way of considering it is by considering, for example, in the case of a chain, for example, what you have seen was this fully connected deep neural network. And in fact, we have shown that the best way of modeling it using toposives was by considering
that each layer is an object of some site. So here, in fact, the feed-forward functioning
of the network, when the network has been trained, will, in fact, correspond to covariant functor x, so from the category which is generated by the graph, by this graph, to the category sets. So, I mean, you can see that it smells the Grothendieck topos,
of course. In fact, this x k plus one k w, which is, in fact, a mapping from x k to x k plus one, so it is a maybe it will be an edge here, will correspond to the learned weights that
goes from layer k plus one to layer k, and it will correspond to each row in this category, c op of gamma. So gamma will be, of course, this graph of the neural network.
And then the weights will be encoded in a covariant functor, so this blackboard w, from the category c op of gamma to the category of sets. So the idea is that at each layer lk, so we define wk as the product of all the sets, in fact, w, so this is essentially
a matrix that goes from the layer l plus one to the layer l of weights, and to the edge that goes from layer lk to layer lk plus one, we associate
the natural forgetting projection that goes from wk to wk plus one. So then the Cartesian product xk times wk, together with this map, will also define a covariance functor, blackboard x, and the natural projection from x to w will be, in fact,
a natural transformation of functors. And what is interesting is that if you consider supervised learning, which is the central case we will consider here, then the backpropagation algorithm can be represented by a flow of natural transformations
of the functor w to itself. And in this case, in the category c of gamma, x, w, w, and x become contravariant functors from this category of sets. That means that
there are presheets of a c, and that means that there will be the objects in the presheets topos c at. So this is the case of a chain, which is quite simple, because, in fact, in all these settings, we will have objects and natural transformations
in the topos of presheets based on this simple site. Now, if you have something a little bit different from a chain, that means if we consider the general case, then the situation becomes a little bit more tricky. And now the functioning and the weights cannot be defined by functors on c of gamma.
So, in fact, what we have done is a canonical modification of this category, and now, for example, if you have this kind of problem to be solved, that means you have in this graph many different, let's say, modules that converge to this subject a,
a small a, then we have to perform a surgery, because considering this as a site will not work at all. And the idea is to introduce new objects here, you can see, between all these a prime,
a second, et cetera, and a, and the object a here by introducing capital A star and capital A, right? And with arrows that go from a star to a and from small a to capital A,
and that form a fork with tips in a prime, a second, et cetera, and the handle will be formed by a capital A star, capital A, and small a. And what is it? And if we reverse zeros, then we'll have a new oriented graph without oriented cycles, and the new category
c will replace that which, sorry, the category which will replace c of gamma will be the category now c of bold gamma, which will be opposite to the category freely generated by this bold gamma. And now, the main structural part, so that means the projection from a
product, so the product of a prime, a sec, et cetera, to its component can be now interpreted by the fact that this presheaf becomes a sheaf for natural rotundic topology, j. And in fact,
on every object x of this new category, the only covering will be the full slice category C on x, except if x is of the type, sorry, a star, where in this case, we had the covering
made by the rows of the type a prime towards a star or a second towards a star, et cetera. Okay. Yeah. So in this case, we have, in fact, basically all possible scenarios that happened, all possible structural scenario that happened in neural networks. So even if we consider
modular neural networks, where you connect the many neural networks to some other ones, et cetera. Okay. So this is a structure, but the structure is not enough. Now we have to consider a second stage, which is now what we call a pre-semantics. And in this case,
we'll see that considering just a rotundic topos will not be enough to characterize all possible neural networks that can be used now and maybe to consider to characterize
some new ones that may emerge in the future. Okay. Let's start with a simple example, which is the example of convolutional neural networks. So this is the one that I showed you
in a preceding slide. So here, you know, the images that, because this convolutional neural network is used for image processing. And so images, of course, I assume to be by nature invariant by planet translation. For example, if you have an object in an image and if you
shift it, of course, this object will still be the same object. And so the idea is to use this invariance in order to learn much more efficiently. That means that you will have much, if you are able to consider this invariance,
this translation invariance, then you will be able to consider much, much less weights to learn. And that means also that you will need much less training data in order to make the neural network. So in this case, in fact, this is imposed to a large number of layers
to accept now a non-trivial action of the group G of 2D translations, and also to a large number of connections between two layers to be compatible with the actions
of this group. So that means that even the underlying linear part, when it exists, will be made by convolutions with a numerical function on the plane. So this is the way,
in fact, this action of the group G of 2D translations will be considered. Of course, it doesn't forbid that in several layers, for example, these last ones, the action of G is trivial in order to get invariant characteristics under translations. So in this case, of course, the layers can be fully connected.
Some other groups have been considered also in the literature together with their convolutions. And now DNS that analyze images, they have always their constructors in the same way,
which is several channels of convolution maps and with max pooling in order to make this as an object. And all these are joined then with this fully connected DNN in order to take
the decision. In fact, this looks as a structure in order to localize the translational invariance. And this is, in fact, what happens in the visual areas in the animal brains. So it's really a copy of the nature. So what is interesting also is that
experiments show here that in the first layers, we can see kinds of wavelet kernels that are formed spontaneously in order to translate contrast and color. And the opposition kernels are formed to construct also the color invariance. So it's these convolutional neural
networks are very, very interesting tool for image processing. OK, so let's go back to our toposic interpretation now. So as we have seen,
we need to take into account this group invariance. So toposic manner to encode this situation, in fact, consists in considering the contravariant functors from the category C of the networks, the one we have seen, that takes into account the structure of the network
with values in the topos of G sets. OK, so because it's, in fact, of course, it is exactly, in fact, the actions of this group G on sets are, in fact, the objects on the topos of G sets. So the collection of these factors with their morphisms,
they will form a category which was shown to be itself a topos by Giraud in 1972. So we thank Olivia to have informed us of this fantastic work from Giraud. And it is equivalent,
in fact, to introduce a category F, which is fibered in groups isomorphic to G over the category C. And it satisfies the axioms of a stack.
So F, in this case, has a canonical topology J, which is the coarsest one, such that pi, the morphism from F to C is continuous. And, in fact, the ordinary topos E of shifts of sets over this site Fj is named the classifying topos of the stack and is naturally equivalent to Cj tilde that we have seen here.
But the Giraud theorem is much more general than that. It doesn't concern only groups, but it extends to any stack over C. And it says that the category of covariant functions from C to the topos of the fibers is equivalent to the classifying topos
of the stack. In this case, nothing is seriously changed compared to group. If the group is replaced by a group read, and if we consider category F, which is fibered in group reads over the category C or its associated stack. For our own purpose, in fact, we have also considered post sets and post sets fibered in group reads instead of group reads.
It's something that I think that Daniel will introduce, but it will not be part of my talk. So with groups, we open the convolutional neural networks. With group read, in fact,
what is interesting is that we open, in fact, the interpretation of the long short-term memory cell RNNs. So, for example, or what are called, for example, the attention networks, which are very powerful networks. So it's a generalization which is very, very interesting
for us. Then we have the language. So, in fact, we have to consider now a vibration, another vibration of F, which is denoted in this case A, and which choose an adapted language
and the semantics over every object of the architecture, augmented by a context in its internal category F. So, in this case, the objects U of the architectural category C, together with psi of the fiber F on U, will represent the pre-semantic contexts in the layer
represented by U. And each one of them possesses a reservoir of logic in the classifying sets of parts, omega of U psi. And, in fact, the transmission of the potential logics between
layers and contexts for morphism alpha H will goes into the two directions. So Daniel will explain with more details these logical functors. So pi star,
and so covariant functor pi sub star and the contravariant one pi index, sorry, exponent star, which come respectively from the red agent, adjoint, sorry,
F star alpha on the left, adjoint F alpha, so which extend at the unit extensions of the pullback defined by the functor F alpha. So you will have a more detailed explanation in the next talk. So, and they will give rules of transformation
of the formulas or axioms that will be available at one layer in two formulas of axiom to another layer, so to another connected layer, and it can be a backward or forward depending on the pi star that we're considering. So pi star will be pi exponent star will be a
kind of projection when pi sub star will be a section of pi exponent star. So it's, in fact, one is, let's say, will go to the output theories when the other one, in fact, will enrich
by considering some other possibilities. So it's something that will be explained by Daniel later on. So now just before considering this concept of information, briefly some,
I would like to show you the results of some basic experiments, and I would like to thanks a lot Xavier Giraud for that. He performed all those experiments. So the first one
were done by using small networks. And, in fact, we want to do, we have been inspired by a result of from a two neuroscientists. So, so two Greek neuroscientists, which have, in fact, analyzed what was happening
after the, what are called the motor equivalent cells, so EMEC neurons. And so,
they found that the neurons that were coming afterwards were, in fact, performing, let's say, Boolean propositional calculus. And we wanted to see exactly what was happening if we replaced those neurons by using artificial neural networks. So we modeled the
output of the MEK cells by, in fact, using some activation signal that were distributed using a
von Mises probability distribution function. And the idea is that we have an activator A that can take three values, capital E, which is the I, that means it corresponds to an activation of the I of the monkey, because those experiments were done on monkeys. It can be,
so capital H was the hand, and E H was, in fact, both I and end. Okay? So, in fact, we used very small neural networks. The first experiment were with three
layers, so an input layer, L0, an output layer, L2, and just one hidden layer, L1, and those numbers are just the number of neurons per layer. And then we tried also four layers and five layers in order to see what was happening. The activation function was the
hyperbolic tangent. Okay? So just very quickly, here is what is happening. So those circles, in fact, on those circles you can see the activation of one neuron, so in some hidden layer. So, in fact, this is cell one, cell two, cell three, and cell four. You can see that,
for example, this is the way they are encoded. The blue curve represents, in fact, the response of the neuron when it's the eye. Okay? The red one, the response of the neuron when
it is the hand, and the green one when it is both eye and hand. Okay? So, well, and then when the curve is dashed, it means that, in fact, the sign of the output of the
hyperbolic tangent is minus one, and if it is not dashed, it is plus one. Okay? So as you can see here, when the curve is red, in fact, the response, the sign of the response depends on the angle. But when it is blue or green, it doesn't depend on the angle.
So we cannot, in fact, deduce any logical behavior when it is red, but we can deduce here a logical behavior when it is blue or green. If it is blue, so it's eye, we can say that eye implies
minus one, it is the sign minus one, and when it is eye and the H, it implies also minus one. If we control pose those two implications, then we obtain that one implies the hand. So that means that if we have, so if the output of this cell is positive, then it means that it was the
hand that was, which was the activation at the input. Right? We can see the same, for example,
for cell four here. So we have only one curve that doesn't change sign. It is the blue one. So it means that eye implies, in this case, plus one. And if we control pose this implication, it means that minus one implies hand or EH. So you can see that here, this neuron
performs already Boolean propositional calculus. All right? With three hidden layers, then the network generates complete triplets of cell. That means that in this case, triplet will be sufficient to conclude in any case,
because the triplet here always, in fact, has this kind of behavior. So one implies I, one implies H, or one implies EH, minus one implies, et cetera. So from these,
this behavior of the three cells of the triplet, in fact, we can conclude in any case, because when you have a configuration that implies the force, it means that, in fact, this configuration never happens, basically. And for the other configuration, we can always
conclude it without any ambiguity. All right? What is interesting here in this experiment is that we have used two different encodings for E, H, or EH at the output. The first one, where we encoded, in fact, these three activation activities, sorry, by using just
an interval, right? And this was, E was at one end of the interval, EH at the other end, and H in the middle. Okay? And the problem is that, in this case, it was very hard to
make logical cells appear. But then, by using this encoding, then of course, this encoding can respect, in some sense, the group of symmetries of the problem, because you basically,
you can exchange E, H, or EH, right? And in this case, in fact, of course, by using it, so then the logical cells were appearing much, much more easily. So it is something that, in fact, shows the fact that the action of this group is very important. In this case,
it's the symmetric group. Okay. Then also, we have done some experiments when, instead of considering three classes, we are considering more classes. So in this, and we
were considering what we call the logical information ratio, which was the number of decidable logical propositions at each layer, divided by the number of logical propositions that can be generated by, in the theory. And in this case, you can see that when you go from
the input layer to the output layer, then this logical information ratio increases. And at the end, basically, you can decide everything in the theory.
Okay. We have done also some experiment based on predicate calculus. And so in this case, we have considered, oh, very quickly, three bars, two or three bars. It could be a red
bar and a green bar, or also a blue bar in our experiments. And we were considering an interval, so a line, or it could be also a circle. So we have tried also to consider a line module or something. And in fact, for example, for two bars, the questions that were asked were,
so the first one, D, are they disjoint? Or I, I, is one bar included in the other one, or Y, O, can they intersect, but the shortest is not included in the longest?
So you can see that compared to the input layer where you just, some senses can region of this interval or of the circle. In fact, the propositions that are involved, in fact, are predicates and not just coming from propositional calculus.
So with three bars, the same questions, but of course, with more possibilities. In fact, for example, so these are the first results. If we train with bars of respective lengths, five units and three units for R and G, and if we test with the same lengths, then,
in fact, let's use this figure maybe rather. So here, in fact, a bar can be just encoded by using the center of the bar and the lengths of the bars. In this case,
what we can see is that, in fact, the testing looks almost perfect. I mean, it's because, basically, we are able to, I mean, it's because the testing is done with the same lengths. In
fact, we don't ask to the neural network to generalize in any sense. But if we ask it to generalize, then, for example, by using still for training the bars of lengths five and three, and for testing, for example, with lengths four and six, and also we can exchange the bars now,
for example, the longest one in training becomes the longest, the shortest one in testing. Then, as you can see, the results are not so bad, but it's quite blurry here, here, here, and here. So it's, I mean, of course, it's not bad, but, of course, a little bit
worse than in the preceding case. What is interesting, also, is that if we are considering, for example, less than 300 layers, then the neural network
is not using logics to perform the task, but it's using Fourier analysis. But from three layers and beyond, then it's a logical analysis that is performed.
All right. What is interesting is that disconnection and inclusion only are the most frequent outputs. And in order to decide the inclusion only, in fact, most neurons, instead of training to decide it directly,
they eliminate the two other possibilities, D or II, probably because IO is more difficult from the point of view of predicted calculus than the two other possibilities. But it's just a hypothesis.
Okay, if we have enriched training, in this case, then, in fact, we have a remarkable logical behavior, where we, just by using, in fact, the outputs of two neurons, we can basically
answer to the questions that are asked. Of course, it's, I mean, it needs quite high generalization power from the neural network, but it's not bad at all.
There is also a very nice relation between the weights on the, I mean, the lattice weights and the logic. In fact, what is interesting is that at the last layers, in fact, weights, it is as if weights were performing the proof. Sorry, because here we can see that,
for example, sorry, maybe it's, yes, if you're considering, for example, the histogram of deductive power of the weights applied to quantize activities, if you are considering all possible triplets of neurons, this layer,
in this case, it's six objective functions, so it's three colors, then you can see that the weights are, I mean, are distributed basically in almost everywhere, let's say,
but if you just select the triplets that are, I mean, the interesting triplets, then the weights, in fact, become much more, I mean, have a distribution that is much more
narrow, and so it's really because, in this case, they are, in fact, they basically perform the proof from the, for example, the last hidden layer to the output layer.
Also, we had, oh sorry, maybe I will have to skip it, because I have not much time to do that. Okay, so now let's go to the part four, which is, in fact, something related to
the notion of semantic information. If we want to define what is semantic information, then we need to understand how semantics appears, for example, if you use a neural network. Okay, first of all, the semantic category that we will consider
is, I mean, is a quite general category. So first of all, the artificial intelligence is connected to the real world, the one that we are perceiving, and so in this case languages
have to be, let's say, as general as possible. There cannot be just the languages that are used currently, for example, in mathematics. They have to be richer than that, and
as it has been suggested in Lambek and Scott, in fact, a good caricature of semantics will be, of course, the interpretation of the language in a complete category. Of course, topos is a good example, but here we aim at being more general. So, and what we propose is
bi-clause monoidal category, and in fact, it is a category such that for any triple of objects x, y, and a, there will exist two, let's say, exponential objects, one on the left and one right, and natural bijections such that those equivalents will be satisfied.
So what does it mean in terms of language? It means that if, for example, x, y, etc., are the meanings of something, so this tensor product would be the composition
and the exponents on the left and on the right will be respectively the conditioning of a by respectively the presupposition of x or the post-supposition of y, and the rows
in this category would be associations of meaning, evocations, etc. Of course, if those two exponential, if they can commute, then we get the classical case of topos. So in this case, theories will be collection of objects and rows such that if,
let's say, a belongs to the theory t, and if this row from a to b is in a, then b has to be in a. And two actions of this monoid a, so one on the right and one on the left, given by the
exponentials will be named conditioning, and these conditionings will be essential to define the notion of semantic information that will be defined in more details by Daniel in the next talk. So let's see now what we call data sets. In fact, we'll see that
it's not in machine learning, they are not really data sets that are much more than that. So in order to see that, let's consider the case of supervised learning.
So we have input data, so xain, which are, let's say, elements of a big set of possible data. So they are basically all possible data that can be seen at some point by the neural network.
And then you have at the output, basically, some theories t out, belongings to a set of theory, capital theta. So in the classical settings of machine learning, a neural network is seen as a parameterized set
of functions. So fw, so it's parameterized by the weights w from x, and which associate to any data, xain, a theory t out. For example, you have data and you have to say if
it is a cat is true or false. So for this very simple example. So they have been in a universal approximation theorem by Szybenko in 1989, that says that continuous maps from a compact subset k of a numerical space, sorry, rd,
so basically, in this case, to be the input data, to another numerical space can be approached uniformly on any compact subsets by a standard neural map is fixed non-linearity of sigmoid
type. So the sigmoids are, it's an example of activation function. OK, so basically, this shows that the neural network works well for interpolation, right? But the problem is that it has to be a compact subset. So now what happens outside the compact subset, because the problem is that,
in theory, even with a low probability, you can, in fact, some new, completely new thing is input can happen. And in this case, you have absolutely no guarantee that you will find the right theory corresponding to these data. OK, so what about extrapolation? In fact, it is related to what is called generalization
in machine learning, which is the possibility for neural network to be able to extrapolate. Let's say, but in order to do that, just a theorem of analysis is not enough at all.
We need something else. We need to capture the essence, so the structures of the data, with respect to the goal, which is, in fact, related to the task or the question which is asked to the network. And in this case, what we want is that a small set of data, size zero, which will be the set of data used for training the network,
can be considered as representative for the learning problem. That means that just training on this data set is sufficient to know, basically, what will happen over the whole possible data sets, right? And in this case, the approach of deep learning is, in fact,
what I presented to you in the second part, which is to construct an architecture that will be expressed using the site of the architecture of the neural network,
a stack which will be considered by using vibrations in groups, group weights, or modular categories, and the language, sorry, which will be a vibration of this stack of layers of neurons,
which will be able to extract the structure from a minimal sampling size zero. And, in fact, we will have the relations between data and theories through now properties so that we mean invariance by the action of a group or something modular, group weights or modular
categories, okay? So this is, in fact, we just introduced the action of, in fact, something much modular than the action of a group on the set or of a group weight on the set,
which is, in fact, the action of a category G on another category V. So this category G will act on this other category V when, in fact, the contravariant from G to V is given. And in this case, because, for example, if a group acts on the set, we need to consider
elements. So we need to define elements in the category V, which would be just phi, which would be morphism from U to V, which would be considered as an element of the object V. So now the definition of the action. Suppose that G, so the first category, acts
through this functor F from G to V, and we have V equals F of A. So then, in fact, the orbit of phi under the slice category G over A will be the functor from this left slice category to the right slice category. And it will associate to any morphism this element
of F of A prime in V, and to any arrow from A second to A prime over A, this corresponding morphism from U to F of prime to U to F of A second.
And so in this case, this theory of topos, stacks, and languages will extend the notion of actions of categories and their morphism to the action of fibered category F to fibered category B. And in fact, from group equivalence, which in fact is represented
in the structural properties of CNNs, for example, will go to category equivalence. Okay? And so in particular, so what we have is that now given a shift of category from C to cat, for example, a stack on group weights or in some other categories,
that we consider as a structure of invariance, and another shift that we consider as a structure of information flow, for example, possible theories or information spaces that Daniel will define afterwards, given an object Q of C, an action F on M will be a family of
functions such that we have this nice commutation relation. This is a vast generalization of group equivalence, and it will allow us to consider much, much more general structures on neural networks in order to take into account
so many, many aspects, structural aspects, and the fact that we will be able to generalize much more, much better than what is currently done. So okay, sorry, it's a little bit late, but
okay, we have done an hypothesis of invariance enlargement. That means that in this, between the inputs psi and in fact in the outputs, there exists a kind of layer which we call
a maximal invariance layer that contains basically the full possibility of invariance of the problem, all right? So okay, and in this case,
okay, I'm sorry because it's a little bit late, I have to be very, very, very quick. So the correspondence from the input psi to the output of theories the theta will be said to be justified if there is a language which is external or coming from some supervisor.
Wider than the language of the output, so the language of the question, but of course and broader that the language of the input basically which is just a very simple theory where you have just many, many objects, for example the pixels of the image by
but basically no morphisms, and which are the languages respectively adapted to the questions of the output and the encoding of the input. And these correspondence will factorize by the language l x through a collection of expressions of this type, the following aspects a, b, c, or
psi in the language of input expressed by sentences in the language of x of x in the language l x, sorry, will characterize the proposition in the language of outputs, okay? Okay, yeah, and so in this case we consider something in order to be able then to propose
a kind of theorem of semantic coding which is this exact category, so where you have two semantic
sorry, where you have, sorry, fibered categories at the input at the output, two semantic sheaves of language respectively at the input and output given by the fibered category, and now we will assume that on f this central category which in fact
will be the place where you will have the maximum invariance. We will have this language a and that can provide the justification from for the mapping
sigma from the input to the output, right? And then to prove that every justified problem can be, oh, sorry, realized, oops.
By triple C, F, and A, we must realize this F and A by a stack and languages over a site C that is given by a neural network architecture. Okay, and the invariance under F will be isomorphic to the maximum enlargement of the stack. So this is for now
just an hypothesis, I mean, a kind of conjecture that we hope to be true. Okay. Okay, this is just a first notion of semantic information that has been introduced by
Carnap and Barilel in 1952. In this case, what they had was, in fact, elementary propositions. In this case, in this example, for example, they had three subjects, A, B, C, and then these subjects were individuals, were persons
that could be either that have two different attributes, M for male, F for female, Y for young, and O for old. All right. And then elementary propositions were just something in which you say that A is young and male, and B is young and female,
and C is old and male, for example. All right. And then the combinatorics, in this case, of course, is the all of all these elements of all possible parts of those propositions, elementary propositions. All right.
But what is interesting is that, in this case, what could be information, in fact, by considering some shapes in some spaces and shapes, I will show in which way.
For example, we can see that there exists a Galois group G of the language that is generated by the permutation of the N subjects, in this case, three, the permutations of the values of each attribute and the permutation of the attributes that have the same number of possible values.
In this case, the group of subjects permutation is the symmetric group S3. The transposition of values will be sigma A, A1, A2, I mean this transposition, and same for the gender transposition. And then you have four exchanges of attributes
defined by this permutation, by this permutation sigma K and K3 and tau, right? And this group is generated by, the group generated by sigma, sigma A and sigma J, in fact, will be of order eight. It is simply the dihedral group of D4
of all the isometries of the square with these vertices. And the stabilizer of a vertex will be the cyclic groups C2 of type either sigma or tau. And the stabilizer of an H will be of type sigma A or sigma G, and it will be denoted this way. So in this case, the Galois group of the language will be G,
which is the product of S3 and D4. That means that in the language here, L will be a sheaf over the category G, which plays the role of the fiber F and C as only one object U0 in this case, okay? So you have four types of orbits.
And what is interesting is that, in fact, all those types, in fact, can generate a new proposition in this big language. For example, for type one, it can be translated into all the subjects have the same attributes.
And this is why, et cetera, for the other types, this is why we need to consider now information measures, semantic information measures that are not only scalar quantities, as it is a case in Shannon information measures, but those need to have value in
a space, right? Which is in this case, in fact, okay. So maybe I have to stop, sorry. Okay, so thank you very much, Jean-Claude.