7th HLF – Turing Lecture: Deep Learning for AI - TIB AV-Portal

7th HLF – Turing Lecture: Deep Learning for AI

00:00

2

Related Material

Heidelberg Laureate Forum Foundation

Formal Metadata

Title

7th HLF – Turing Lecture: Deep Learning for AI

Title of Series

7th Heidelberg Laureate Forum (HLF), 2019

Number of Parts

24

Author

License

No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.

Identifiers

10.5446/44088 (DOI)

Publisher

Heidelberg Laureate Forum Foundation

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Abstract

This lecture will look back at some of the principles behind the recent successes of deep learning as well as acknowledge current limitations, and finally propose research directions to build on top of this progress and towards human-level AI. Notions of distributed representations, the curse of dimensionality, and compositionality with neural networks will be discussed, along with the fairly recent advances changing neural networks from pattern recognition devices to systems that can process any data structure thanks to attention mechanisms, and that can imagine novel but plausible configurations of random variables through deep generative networks. At the same time, analyzing the mistakes made by these systems suggests that the dream of learning a hierarchy of representations which disentangle the underlying high-level concepts (of the kind we communicate with language) is far from achieved. This suggests new research directions for deep learning, in particular from the agent perspective, with grounded language learning, discovering causal variables and causal structure, and the ability to explore in an unsupervised way to understand the world and quickly adapt to changes in it. The opinions expressed in this video do not necessarily reflect the views of the Heidelberg Laureate Forum Foundation or any other person or associated institution involved in the making and distribution of the video.

Speech

Text

Image

00:00

Internet forumMusical ensembleComputer scienceAreaBitAlgorithmMultiplication signTuring testComputer animation

01:26

Internet forumTuring testArtificial neural networkPhysicsPhysical lawMechanism designVirtual machineInternet service providerInferenceGroup actionMachine learningMaß <Mathematik>Focus (optics)Function (mathematics)Mathematical optimizationEnterprise architectureJacobi methodModel theoryData structureGreedy algorithmComputer networkTheoryPattern languageGoodness of fitWeightRepresentation (politics)Virtual machineSocial classComputational intelligenceHypothesisRule of inferenceMathematical optimizationSet (mathematics)LogicForm (programming)Musical ensembleInferenceQuicksortComputer-assisted translationPower (physics)Coefficient of determinationArithmetic progressionMedical imagingMeta elementUnsupervised learningPhysical systemArray data structureAttribute grammarSoftwareDeep WebSymbol tableData structureParameter (computer programming)Variable (mathematics)Radical (chemistry)Natural numberFunctional (mathematics)Term (mathematics)Reinforcement learningFeldrechnerWordComputerDifferent (Kate Ryan album)Machine learningField (computer science)Wave packetAxiom of choiceExterior algebraNumberParametrische ErregungLogic programmingClassical physicsComputer animation

08:32

Software engineeringBounded variationModel theoryExponential functionRepresentation (politics)Level (video gaming)Enterprise architectureTask (computing)Linear mapKernel (computing)Function (mathematics)Computer networkMaß <Mathematik>Artificial neural networkNumberoutputFaktorenanalyseHeat transferLocal ringMaxima and minimaSaddle pointError messagePoint (geometry)Greatest elementConvex setPrice indexVirtual machineWordType theoryRepresentation (politics)Focus (optics)Functional (mathematics)NumberDimensional analysisKernel (computing)Parameter (computer programming)Formal languageCentralizer and normalizerStatisticsPixelProgrammschleifeDistribution (mathematics)Execution unitSoftwareVariable (mathematics)Configuration spacePotenz <Mathematik>Natural numberMathematical optimizationPoint (geometry)Intermediate languageFamilyoutputInterior (topology)AbstractionLinear regressionVideo gameWeightLoop (music)Set (mathematics)Cartesian coordinate systemEvoluteRow (database)Nonlinear systemTransformation (genetics)QuicksortClassical physicsWave packetBitPerspective (visual)Codierung <Programmierung>Musical ensembleDivisorArithmetic progressionFlow separationSpacetimeBounded variationFunction (mathematics)LinearizationLevel (video gaming)Computer scienceMathematicsComputational intelligenceArtificial neural networkMappingClique-widthComputer animation

15:16

Price indexCodeSequenceFormal languageModel theoryAsynchronous Transfer ModeEinbettung <Mathematik>Visualization (computer graphics)Hausdorff dimensionPopulation densityLogical constantFood energyFunction (mathematics)Real numberComputer networkGame theoryParameter (computer programming)Artificial neural networkSample (statistics)Distribution (mathematics)ManifoldMaxima and minimaProcess (computing)WordNumberEstimatorSound effectLine (geometry)Dimensional analysisArtificial neural networkDistribution (mathematics)Functional (mathematics)Wave packetSampling (statistics)Term (mathematics)Population densityOrder (biology)Structural loadoutputRepresentation (politics)Product (business)Convex optimizationDialectQuicksortMedical imagingMathematical optimizationUltraviolet photoelectron spectroscopyWeightDensity functional theoryRight angleForm (programming)Point (geometry)Power (physics)Random number generationTheoryFunction (mathematics)Game theorySubsetConjugacy classDifferent (Kate Ryan album)Parameter (computer programming)1 (number)Attribute grammarSoftwareSemantics (computer science)Execution unitArray data structureVisualization (computer graphics)Similarity (geometry)Element (mathematics)Einbettung <Mathematik>Unsupervised learningField (computer science)Set (mathematics)Multiplication signNatural languageEvent horizonError messageDirection (geometry)Greatest elementVirtual machineGoodness of fitOptimization problemMaximum likelihoodNear-ringAdditionArithmetic meanElectric generatorComputer animation

22:01

Duality (mathematics)Limit (category theory)State of matterOcean currentMedical imagingoutputTask (computing)Conditional expectationArithmetic progressionPoint (geometry)Computer animation

23:01

System programmingCategory of beingPhysical systemFormal languageGame controllerRead-only memorySequenceLogicMaß <Mathematik>Mechanism designGoogolSurface of revolutionOperations researchSet (mathematics)Virtual machineTranslation (relic)Performance appraisalComputer networkLevel (video gaming)PhysicsUnsupervised learningModel theoryProcess (computing)Semiconductor memoryMechanism designNichtlineares GleichungssystemInheritance (object-oriented programming)Right angleThread (computing)MereologyTransformation (genetics)Recurrence relationLevel (video gaming)Graph (mathematics)Supervised learningPhysicalismIntegrated development environmentDirection (geometry)SpeicheradresseSet (mathematics)Type theoryStreaming mediaoutputDifferent (Kate Ryan album)Coefficient of determinationArithmetic progressionDivision (mathematics)Cartesian coordinate systemMultiplication signState of matterMaschinelle ÜbersetzungFormal languagePhysical systemTask (computing)Wave packetFundamental theorem of algebraCASE <Informatik>WeightSoftwareSequenceParallel portModel theoryMachine learningAddress spaceOrder (biology)Network topologySinc functionHydraulic jumpElement (mathematics)Source codeWordData structureMedical imagingImage resolutionKey (cryptography)Video gameBitComputer programmingSocial classCategory of beingSoftware developerDecision theorySynchronizationComputer animation

30:30

StatisticsGroup actionModel theoryMachine learningRepresentation (politics)FaktorenanalyseTemporal logicDependent and independent variablesCausalityMechanism designDarstellungsraumAbstractionLevel (video gaming)Electronic program guideObject (grammar)Wechselseitige InformationTrajectoryVariable (mathematics)Graph (mathematics)TheoryModul <Datentyp>Distribution (mathematics)Kolmogorov complexitySample (statistics)HypothesisMathematicsIntegrated development environmentPosition operatorCausalityLevel (video gaming)Distribution (mathematics)Pointer (computer programming)Right angleIntegrated development environmentRepresentation (politics)Object (grammar)Group actionMachine learningMechanism designWeightVirtual machineState of matterSequencePhysical systemTheoryMathematicsSoftware testingWordFocus (optics)Selectivity (electronic)Interactive televisionSpacetimeNatural numberMedical imagingPixelVariable (mathematics)PredictabilityMultiplication signLine (geometry)Non-uniform rational B-splineOrder (biology)Data structureModel theorySparse matrixInformationSound effectCategory of beingoutputPlanningPerspective (visual)AdditionDifferent (Kate Ryan album)Form (programming)PressureGoodness of fitComputer animation

38:00

Distribution (mathematics)CausalityFactorizationData structureIntegrated development environmentHeat transferMechanism designMeta elementModel theoryBuildingPhysical systemVariable (mathematics)AbstractionSpacetimeSelf-organizationCoordinate systemRadio-frequency identificationInternet forumPhysical systemControl systemSemiconductor memoryPredictabilityRight angleMereologySlide ruleMathematicsVariable (mathematics)CausalityData recoveryPhysical lawCategory of beingComputational intelligencePlastikkarteMultiplication signSound effectPresentation of a groupBridging (networking)Model theoryMechanism designOrder (biology)DeterminismParameter (computer programming)QuicksortSymbol tableExpert systemDomain-specific languageDistribution (mathematics)HorizonClassical physicsBuildingSpacetimeFocus (optics)Goodness of fitTrajectoryCASE <Informatik>Different (Kate Ryan album)Self-organizationForm (programming)Student's t-testGroup actionWeightIntegrated development environmentWordLie groupRegulator geneDependent and independent variablesComputer programmingConnected spaceLink (knot theory)Sinc functionMusical ensembleGradientComputer animation

45:30

Internet forumComputer animation

Transcript: English(auto-generated)

00:23

So welcome to the morning session. It's my pleasure to announce the first speaker, Yojua Bengio. He was born in France, lives in Canada, is professor for computer science at the Université de Montréal and scientific director at the Montreal Institute for Learning Algorithms. He received the 2018 ACM

00:42

AM Touring Award together with Geoffrey Hinton and Jan LeCun for his groundbreaking work in the area of deep learning. Now there's a tradition with the ACM Touring Award which asks every recipient to give a special lecture called the Touring Lecture at some time after receiving the award.

01:02

Geoffrey Hinton and Jan LeCun did this earlier this year and we are very proud that Yojua Bengio decided to deliver his Turing Lecture at the HLF today. He had a minor accident yesterday, I hope it's a minor accident, which impedes his walking a little bit so we particularly appreciate that he decided to deliver his lecture anyway. Yojua, the floor is yours.

01:40

Good morning everyone. So I'm going to tell you about deep learning of course which is the thing I've been working on for most of my career. And the thing that got me into this field, the thing that really got me passionate about it, is something that I realized early on when I was looking

02:01

for a research topic when I started my masters, actually just before I started my masters. And it's the hypothesis that there may be a few simple principles which would explain our intelligence, animal intelligence, and

02:21

would allow us to build intelligent machines. Now this is a hypothesis, we know this to be true, but a lot of what we have achieved with deep learning in the last few years or decades reinforces that hypothesis. The alternate

02:40

hypothesis is that our intelligence is just a huge bag of tricks of pieces of knowledge. Not so exciting. So when I was starting to study AI, the thing that was very popular, the thing that I was taught in my AI class was rule-based

03:06

systems, symbolic systems. So I learned about all these things and search and logic and there was no learning. And people then in the early 80s believed that we might build intelligent machines by giving machines our knowledge, by

03:26

formalizing that knowledge in a form that machines could use and do inference with. Unfortunately there is a problem. A lot of the things we know, we know in an intuitive way, which really means that we know it but we can't describe

03:45

it in a clear and precise way. Sometimes we think we can explain it, we come up with a story, but it's only a superficial explanation. And if we try to program a computer with that knowledge, it just doesn't work. So it's just that we don't have complete knowledge of how our brain works essentially. There

04:04

are other issues with a classical approach like a lack of grounding of these high-level logical ideas in perception and insufficient handling of uncertainty but I won't talk too much about these things. And on all of these fronts, machine learning has been making a lot of progress and deep

04:26

learning in particular. So the particular sort of machine learning that I started doing in the mid 80s and I've been doing since then, it's called neural nets, and essentially it's inspired by the brain. The idea that the

04:44

kind of computation that can achieve intelligence is obtained by the synergy of a large number of simple computations like those of neurons. And there's one idea which maybe people don't realize is really important about neural nets, and even some neural nets are missing that. And it's really my

05:03

friend Jeff Hinton who pushed this idea and allowed me to capitalize on it and write a lot of papers that had influence, such as the work on representing words by vectors. It's the idea of distributed representations.

05:23

It's the idea that instead of having a symbol to characterize a concept, we have a vector of attributes and these attributes are not handcrafted but they can be learned. And so two concepts like cat and dog can be close to each other and so what you learn from one can teach the machine something about

05:44

the other. Another big thing that I worked on during my PhD and has become really important in the last decade is the idea that we can build really sophisticated machine learning systems by making them come out of

06:01

three main ingredients. What is the thing we're trying to optimize? A reward function and objective function. How are we going to optimize it? So that's sort of a learning rule or an optimization algorithm and a class of functions or parametrization and an initial set of values for those parameters. Basically all of modern deep learning can be boiled down to these

06:22

choices. We call that end-to-end learning. So I've worked on these end-to-end things early on in my career but the thing that really made a

06:41

difference in the last century when we found a few tricks to train neural nets that had more layers than was previously possible and this is why we call this deep learning. Also because the term neural net was really not popular in those days and so we thought let's invent something new. So

07:05

let me also mention a few things that happened after that. So in the last decade we've been working on unsupervised learning methods. In fact the way we got these deep nets to work in the first place initially was thanks to advances in unsupervised learning methods and more recently we got these

07:26

generative networks that are unsupervised to do pretty amazing things and I'll show you some examples of that to synthesize images that look very very natural and finally more recently I think we stumbled in my group

07:44

in particular but now it's everywhere in deep learning on the power of attention so something that really changes the nature of what a neural net is. People used to think of neural nets as vector processing machines. You take a vector in and you produce a vector out or you produce classifications or

08:03

probabilities but really what attention buys us is something that brings us closer to classical AI goals of dealing with variables and being able to process any kind of data structure essentially and not just vectors. So

08:20

I'll come back to that later. Another thing they worked on very early on which is now very very popular is called meta learning. It's the idea that we can have not a single optimization going on but embedded optimization so think about evolution as sort of an outer loop of optimization and individual

08:41

learning as an inner loop but in fact even in an individual's life you might have these loops of optimization within each other and it's a very powerful idea which allows one to train a system so that it will generalize better or even that we will generalize out of distribution so to new things

09:01

that it hasn't been trained on, new types of things that it hasn't been trained on. Okay so now let me say a few words about compositionality which is a central thing in computer science and also actually a central thing in neural nets. Compositionality becomes really important to deal with a

09:26

statistical methods when you try to apply them to high dimensional spaces and basically what you're trying to do in statistics is in many applications of statistics is learning a function and over some space but if

09:44

that space is high dimensional essentially you have to estimate for each configuration of the input variables what the function should be and if you collect enough examples of each configuration you can see by some counting arguments and simple statistics that you might be able to predict new

10:02

values fairly well. Unfortunately the number of these configurations grows exponentially so that's the curse of dimensionality so you will never have enough data to cover all of these configurations and so this is illustrating what happens from one to two to three dimensions but of course we are dealing with thousands or millions of dimensions like a thousand by

10:21

thousand picture pixels picture is is a million inputs. So if you have an exponential curse you have to use other exponentials to fight it that's the idea of compositionality and in neural nets we're getting compositionality in two ways and more recently working on a third way but

10:44

the main ways we're getting compositionality in the classical neural nets is through these distributor representations meaning that for any given input I can have an exponential number of configurations of the hidden units these features that are being learned and then there's

11:03

another kind of compositionality which comes so these different features can be composed right but also when I stack one layer on top of another layer it's just a natural usual mathematical composition of functions and and again we had the hunch 20 years ago that we could get an exponential advantage but

11:25

it's only like five years ago that we were able to prove mathematically that you get this kind of exponential advantage by this compositional nature of the computation itself. Let me skip this so so one of the central points

11:40

about neural nets is that they allow us to learn representations so we don't go directly from inputs to outputs as in like linear models and there there are other machine learning methods kernel methods in particular that have an intermediate representation but it's fixed it's fixed

12:01

by hand whereas the focus with deep learning and neural nets in general is how we learn these intermediate representations. So yeah so as I said in 2014 we were able to show that both the number of units in a particular layer and the depth of the network can buy an exponential advantage in the

12:23

sense that the set of functions that can be represented with these neural nets grows exponentially in in a sense with the either the the width of a layer or the number of layers and the way that we're saying that the set

12:43

grows is we can build these neural nets that are piecewise linear by making each of the non-linearity piece piecewise linear and so if you have a piecewise linear function a convenient thing you can do mathematically is just count how many pieces they are and so the number of pieces grows exponentially that's what we mean here and and so

13:01

you've got these families of functions that have the ability to represent piecewise linear functions that have an exponentially large number of pieces but the number of parameters you need to express any function in that set is not exponential it's linear in in whatever n the thing you're growing so

13:23

so that's that's one very dry way of thinking about the advantage but there's a more cognitive way of thinking about it when we compose things on top of each other what we are actually hoping is that the higher levels of representations correspond to more abstract concepts so you can think

13:45

of the first level as very simple like linear features and then we've got these non-linearities composed on top of each other and the sort of dream that I put forward about 15 years ago is the idea that the really what we

14:03

want from these networks is that a top-level features somehow capture the underlying explanations for the data the transformation performed by this encoder this neural net that maps low-level things like pixels into high-level features separates or disentangles these underlying factors

14:24

of variation and what we then intuitively believe is that if we have such a transformation then we can answer almost any question that humans care about with very little data and so like in when you put the data in the

14:42

right space all kinds of questions become easier to to answer but it's not any question it's the kinds of questions that humans care about right because these high-level semantic factors are also the kinds of concepts we communicate with language so that's what we'd like these networks to

15:01

discover and we haven't achieved this but we've made some progress another contribution that helped understand a little bit from a theoretical perspective why training those nets actually worked has to do with one of the biggest hurdles to the acceptance of neural nets in the 90s and early

15:24

2000 which was the idea that when you have a these highly non-convex optimization problems with lots of units and lots of parameters you're very likely to get stuck in the optimization in poor local minima and that's what people thought in in the 90s and the early 90s people even

15:43

proved that such bad local minima existed well it turns out that a lot of the intuitions people have built about local minima have come about from what happens in 2d or 3d right so if if I draw a random function in 1d you'll see

16:04

it has many ups and downs and of course some of them will be some of the local minima will be bad they will be high right but it turns out that in high dimensions and the higher the dimension the effect gets stronger in high dimensions instead of having lots of bad local minima what you have is a lot of saddle points and instead the local minima tend to congregate and

16:24

concentrate in terms of probability near the bottom near good values another way to think about it is in in high dimensions a local minimum is kind of an extraordinary thing because it needs a conjunction of unlikely events

16:40

that in any direction you can't escape right when there are many dimensions so if you think of it like it was a random thing going on you would have sort of to be unlucky that in every single dimension you have no way to escape right so that's one way to think about what's going on another

17:01

thing that had a big impact that's connected to these district representations is the notion that we can represent high-level concepts like words with these distributed representations so each word is going to be associated with a vector and we're going to learn those vectors and

17:22

each element in those vectors you can think of representing attributes but they may not be ones that we define ahead of time and in 2000 we showed you could learn these representations by adding an extra special layer as input to a neural net that predicts the next word given previous words and that

17:43

line of work has been extremely successful it took a while though so at the time when this came out it went pretty much unnoticed but ten years later or 15 years later pretty much the whole field of natural language processing uses these word embeddings and these word vectors so if you if

18:04

you peer into these attributes that are being learned they actually make sense so these are two-dimensional visualizations of these representations you can zoom in and and see that words that have similar syntactic grammatical or semantic attributes get to be close to each other

18:24

like different conjugations of the same verb or different verbs that have similar meanings and so on as I mentioned earlier one of the key advances we've made in the last two decades with neural nets has to do with

18:43

the unsupervised learning of these networks and I think there's still a lot more to do but but let me tell you a few words about something that had a huge impact recently and it starts from the idea that instead of trying to learn a density function which is the classical way of representing a joint

19:05

distribution or a density if you have a lot of data and you want to know what's the form of that distribution you want to characterize the distribution you can just learn the probability function but it's hard to do that for a number of reasons so because we had so much success in training

19:22

neural nets to be classifiers we thought could we reuse this really big power of learning to classify in order to learn about distributions and the answer is essentially yes so intuitively it's it's pretty obvious that you can take you can build a classifier that can separate the high-density region

19:42

from a lot from the low-density regions and in particular we we know that for the kinds of data we tend to care about in AI like images and sounds and texts the data distribution is concentrated near what people tend to call manifolds but basically it's just nearby sets of points that have

20:03

high probability and so if you can just separate these these high-density points from the rest we could do a good job and and a few researchers had been exploring this in particular Goodman and Ihevaranen and what happened in 2014 is after trying many things to play with this idea we come

20:24

up with GANs generative additional networks which are interesting in many ways instead of having one objective function which had been the you know standard way of doing things we went into a game theoretical setup where we have two networks that are going to be learned and they are going to be in a

20:46

min-max game where one is trying to defeat the other so one of the networks is going to represent the distribution through a generating function so it's just a neural net that takes random numbers like Gaussian and outputs say images samples of the distribution you're trying to represent so instead of

21:03

representing the density through a probability function we are representing it through a generator which is convenient if what you want at the end is a machine that you can sample from so that's one network it's sort of the end product of training but to train it we need an objective function and the problem is there is no objective function for a generator we know maximum

21:26

likelihood for density estimation where you learn the density function but what about if what I learn is a generative process well so what you could do is train a second network which is a classifier that train that is trained

21:44

to distinguish between what the generator produces and the real data and so the generator is going to be trained to maximize the error of the discriminator and the discriminator is just trying to detect those attempts at fooling it and the whole thing works although initially so you can see

22:01

that 2014 picture here that these are all synthesized images initially it didn't work that great but over the last five years the progress has been pretty amazing and so you can see how just the task of synthesizing face like images has progressed to the point that now you look at some of

22:20

these images it's hard to tell that they are synthetic you can also make these GANs conditional meaning that you can put extra inputs like a sentence for example here it says this bird is red with white and has a very short beak and it would generate an image that corresponds to that input this is

22:41

of course very useful in practice okay so so up to now I've been talking mostly about the things that we did that worked well and and now let me tell you about the limitations of the current systems because as a researcher I'm not happy with the current state of things and I'm looking

23:01

forward one of the ways of thinking about the different types of tasks that are our mind our brain are doing is this division between system one tasks and system two tasks so system one tasks are the kinds of things you can

23:22

do in about half a second intuitively and in a way where you don't have access to the knowledge of how you do it so it's it's completely unconscious and and and you can't express that that's non-linguistic deep learning as we know it today is very good at that right so perception is is a typical

23:42

type of system one task but other things you can do very quickly fall in that category as well whereas system two tasks were the kinds of tasks that you know when I was taking my AI classes in the mid 80s we were trying to solve so things having to do with reason things having to do with programming things

24:03

that you could express algorithmically and well what I believe is obviously we need both of these things and one of the research directions that I and others are looking at is what's called grounded language learning where we're

24:21

trying to learn to train a system that can do both right so so language is really a system to task but but in order to be able to generalize properly to understand what the sentences mean what I and others believe is that you also need to learn a model of the world you need you need to be able to predict how things would unfold like with physics and and babies do that

24:44

they have a intuitive physics model that they learn in the early years of their life so I think a lot of the upcoming work in deep learning is trying to tackle these system two tasks at the same time and relying on

25:01

the things that we were able to do at the level of system one and in order to advance on that path let me tell you about attention mechanisms because I think the they are a key to that progress so we introduced attention mechanisms in particular kind of attention mechanisms called content

25:22

based soft attention in 2014 and we introduced it to solve problems in machine translation because when you translate and you're producing the next word you want to be able to focus on just one or a couple of words in the source sentence and this ability to focus on a few elements is what

25:41

attention is about and what we found is that introducing this mechanism really made a huge difference and in a matter of a year or two Google had included that that trick in their Google translate system which made a huge jump in performance so that since since 2016 we have systems that are way way

26:04

better than classical methods which were based on n-grams but it turns out that this attention mechanism is doing much more than a trick for machine translation it allows to work on any kind of data structure because now if you think about a tree or a graph you can you can focus your attention on

26:21

one node or another node sequentially visiting the parts that you need in order to produce an answer it's also connected to memory so you can think of memory access as selecting one from a few sets of possible memory locations that you want to read from so this is very very different from the usual ways

26:44

of thinking about memory where we give an address and we retrieve something here it's like we're looking at all of the entries in the memory at once in parallel and the things that somehow echo our requests are going to be associatively retrieved but you can do it in a soft way so you're actually accessing all of the entries at the same time with different

27:03

weights and these are the attention weights that you can learn so these these attention mechanisms they learn where to put attention right and in the case of memory it's where to read we also have ongoing work suggesting that the this these attention mechanisms and and the use of a memory can

27:23

actually defeat one of the big issues with recurrent networks which process sequences something that one of my most cited papers is from the early 90s where we showed there's a fundamental challenge in training recurrent networks which which are these networks which operate through

27:41

time and have an internal state it turns out that using attention you can and a memory you can just go back to old states and and skip through time in a way that eliminates this problem I believe this is still ongoing work but we had a paper at the last NeurIPS also these attention mechanisms today

28:02

are being used throughout the applications of deep learning in language processing tasks with what people called self attention and transformers one of the other threads that we're working on these days that's connected to attention and I mentioned briefly is how we could use

28:21

deep learning to build up something that's more like at least one of the aspects of consciousness where you use attention again to focus on a few aspects of the state of the world that you need in order to take decisions to plan to imagine and so on so we've made huge progress but

28:44

we're still very far from human level AI it is important to realize now because sometimes you can hear a little bit of exaggeration if you look at the industrially deployed systems there for the most part based on supervised learning where humans have to provide the high-level concepts

29:01

through labels and if you look at the mistakes that these systems make like you can you can create mistakes for them like what you see on the right here the two images are actually slightly different and the one on the right was tweaked just a little bit to make the neural net classifier think that it's

29:20

an ostrich rather than a dog so you optimize the input just slightly to achieve that and of course humans don't make these kinds of mistakes and there are other even more revealing kinds of mistakes that we have studied that suggests that the stream of discovering the kind of high-level underlying explanations for the data with deep learning isn't yet achieved

29:44

and although there's substantial progress in that direction we still have a lot more to do and part of that is I think part of the answer is looking at how humans manage to do what they're doing and in particular children and infants even a two-year-old for example understands physics now not with

30:06

equations of course it's called intuitive physics and and and babies learn this not through supervised learning right because their parents don't teach them about Newton's equations they they do that just by interacting with their environment and observing how it works so this is the kind of learning ability

30:23

that I think we're missing right now in in machine learning and that we need to develop it's it's related to the question that I call the agent perspective on machine learning and it's the idea that we need to build

30:40

models of the world we need to have machines that learn how the world ticks including the causal structure something that up to now deep learning hasn't been looking very much and it's a very very recent thing that was the first workshop on causality at the last Europe's but I see this as a growing trend and and my group is working on this as well so so let me

31:05

go back to this question of representation because it was it's at the heart of what we achieved and I think it's still at the heart of what I see moving forward so how can we coax these neural nets to discover the right kind

31:22

of high-level concepts that humans manipulate and reason with in in more generally what is a good representation so we've we've been thinking about this for many years and what I believe is that a single objective like reconstructing the inputs which is a common thing people do these days is

31:43

not going to be sufficient we need to provide clues to these learning machines so that they can find these good representations and it's it's the different pressures from different clues which you can think of prior knowledge but very general forms of prior knowledge about space about time about

32:02

causality about agency about what I call sparsity of the dependency structure that I think can help learning systems discover these high levels of representations so so one of them is this sparsity of

32:24

dependencies at the high level so in other words if you look at the world at the level of pixels the dependencies are very strong like every pixel is dependent on every pixel in a very very complicated way and in order to predict one pixel you would need to look at essentially all of the other pixels and in the image to get a good prediction and even then it won't

32:43

be such a great prediction but if I move from the pixel space to this ideal space that I'm thinking about the kind of space where we describe the image with words then the dependencies become very different in nature I can predict one thing like I'm gonna catch this object with just a couple of

33:07

other things right so they're very very strong dependencies meaning the probability that I can see some prediction be realized at that level can be very very high and at the same time which sounds contradictory I can do

33:22

it using very few other variables so this I call the consciousness prior because I believe that what we do with our consciousness is focus on just a few of these aspects of the world that have strong dependencies together and we form these thoughts that that aggregate these variables and allow us

33:41

to predict things that are gonna happen or imagine things that could happen and things like that and so in this line of work the way we're thinking about high level representations is now we have two levels there is the unconscious state which is all of the things you could think about and you have the conscious state which is a selection

34:04

using an attention mechanism of just a few aspects that currently you're thinking about another clue that's more connected to the causality issue and the agency issue is having to do with how a learning agent interacts with their

34:25

environment so now we're not only thinking about images we're thinking about policies we're thinking about how an agent acts in the world and can have intentions and in particular if I go back to this thing of dropping the pointer when I did it I created a mental representation of the sequence of

34:44

actions that led to dropping the pointer right so my my mind can build these internal representations of sequences of actions or intentions to achieve things in the world and what we're saying is that there's a

35:01

relationship between representations of these intentions and representations of the state of the world so it would be very convenient if the representations of my intentions are very very close in some ways to the representations of the word so if I can modify the position of that pointer then the position of

35:20

that pointer should be one of these high-level quantities so the things that we can control through our agency should be things we can name okay so that's like an extra clue that we can exploit and of course it has to do with causality in the sense that when we start intervening in the world we're changing it and this is a very very powerful thing this is how we could use

35:44

that information to plan and even take into account what other agents in the environment will potentially do so what we're working on these days are things like the idea that these high-level variables not only they have

36:02

this sparse dependencies and they're connected to their representation is connected to the kinds of intentions we can have the kinds of things we can control in the world what what psychologists call affordances but in addition those variables have this property that many of them can act as

36:23

cause or effect so if if I try to find a causal relationship between pixels I won't be able to find much like there is no intervention on a pixel that will create other pixels to do things right but I can intervene on the position of this pointer right so so there is another clue about these

36:44

high-level variables and we can use that to learn what are those causal variables something that hasn't been done much up to now in machine learning so let me let me say that this is connected to some of the basic

37:03

questions in theory of machine learning as well so learning theory has been based on the assumption that the data comes from one distribution and and you've got this I ID data so all the examples come from this same distribution the test data comes from the same distribution but when you

37:21

deploy machine learning for real it's not like that we train from data from the lab or from some particular country say and then we'd like to apply those systems in a different country so there's often a change in distribution also because the world changes there's out of distribution generalization is one of the most important concerns these days in machine learning and we

37:42

don't have good theories for this so so what I believe is that one of the key ingredients to face this is to make very mild assumptions about how the world changes when you go from one distribution to another distribution and the assumption that we are exploring is connected to the notion of agency so

38:05

what an agent acts in the world like I push things around I open the door I sprained my foot so this is a good example the distribution that I'm seeing is now very very different since yesterday evening I'm learning so many

38:21

things right that I didn't know I could do so so these changes come from agents and agents tend to have an effect on the world that's localized right localized because we can only act at a particular time and a particular space and so we can exploit that assumption to say let's build models of

38:45

the world which have this extra property that when things change in the world the causes of those changes can be identified to just a few variables or a few parameters in our model of the world so this is what we've been doing over the past year and I'm not going to go into the details but I want to

39:07

close with this slide which has not a direct connection to the science we've been doing but some of the concerns and some of the initiatives that we're taking I believe that as AI is moving out of the lab scientists and engineers

39:25

who are developing this have a responsibility because it's it's it's deployed now everywhere in society and it's just a tip of the iceberg and of course when you build powerful tools these tools can be used for good or they could be misused and it becomes really really important to clarify what

39:45

are acceptable ways of using these technologies and what are not acceptable ways and sometimes we have to change laws and regulations in order to do that properly the other thing is we can encourage the good uses so one of the things we've been doing in my group is encourage our students

40:02

encourage other researchers to connect with people like in NGOs or other organizations that have problems to try to favor the deployment of AI systems in healthcare the environment fighting climate change education and

40:21

things like this in particular we created a new organization called AI comments which has the goal to coordinate different actors who want to do AI for social good but maybe coming from different horizons maybe some of them are domain experts like NGOs and others are grad students with with with

40:41

the skills on this thank you very much thank you very much for a fascinating talk we have time for a couple of questions from the audience

41:03

and I would like to ask the helpers with the microphones to yes Joseph I guess I'll just give you my mic thank you for the inspiring talk I have a question regarding the possibility to bridge system one and system two right

41:24

so the way I understand this is that in system one you have implicit knowledge yes and you should make it somehow explicit no it's not gonna work you cannot make it explicit because it would require billions of computations that I mean you could make it explicit as a program but it's a

41:43

completely understandable program system to you deal with explicit knowledge that's right okay so how you will connect the two systems of thinking that's that's my question right right exactly so so this is of course speculations because it has been done yet people have tried and I think

42:00

the wrong way of doing this is think we could just stick the classical eye ideas on top of neural nets right I don't think this is gonna work and it doesn't even correspond to the kind of the form of reasoning for example that humans actually do we don't look at millions of trajectories of search in our mind at least not consciously so instead what we're trying to do is to

42:22

learn what system to competition should be doing using things like attention mechanisms so that it can select the sort of variables at the high level and and just combine them in order to make predictions about the future about

42:40

some some other variables that you might care about and also link them to things that may have happened in the past through memory and then instead of having this very very unnatural search what you would have is a very very smart control mechanism that knows what to think about and that part is actually

43:02

system one so you're you don't have conscious access to what you decide to think about it just happens and that's part of system one right so in other words we're gonna embed the system the intuitive capabilities into the form of computation that is needed in system two I sympathize with your

43:32

focus on consciousness as a way of identifying certain features that are worth pursuing like noticing certain clues but what that means is that

43:46

different systems will have different consciousness that's right for personalities this corresponds to finding two perfectly decent Americans one of whom is sensitive to Trump's lies as a memory long enough to be able

44:01

to recognize them and therefore despises him the other on the other hand does not notice lies does not remember what happened two weeks ago but does notice people with dark skins especially if they speak English poorly yes you're going to get very different kinds of consciousness especially if you

44:22

expect these things to be done automatically right and now what do you do well actually it's already the case so the way we train these systems right now is we initialize their parameters randomly and then depending on how you initialize them depending on the order in which examples

44:40

they see exactly in their lifetime they will actually develop different representations and so they will have a different subjective experience of the world I don't think it's such a big thing we can run lots of experiments to see how they differ in their predictions we can have committees and

45:03

symbols where we basically make them vote on something so yeah I don't see this as a huge problem of course it's very very different from the kind of crisp deterministic repeatable computation that we're used to with system 2 thank you very much again Joshua for your great presentation and

45:24

speedy recovery for your injury thanks thank you very much