Add to Watchlist
6th HLF – Laureate Lectures: Equilibria, Fixed Points, and Computational Complexity: from von Neumann to Generative Adversarial Networks
0 views
Citation of segmentFormal Metadata
Title  6th HLF – Laureate Lectures: Equilibria, Fixed Points, and Computational Complexity: from von Neumann to Generative Adversarial Networks 
Title of Series  6th Heidelberg Laureate Forum (HLF), 2018 
Author 
Daskalakis, Constantinos

License 
No Open Access License: German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties. 
DOI  10.5446/40171 
Publisher  Heidelberg Laureate Forum Foundation 
Release Date  2018 
Language  English 
Content Metadata
Subject Area  Computer Science, Mathematics 
Abstract  Constantinos Daskalakis: "Equilibria, Fixed Points, and Computational Complexity: from von Neumann to Generative Adversarial Networks" The concept of equilibrium, in its various forms, has played a central role in the development of Game Theory and Economics. The mathematical properties and computational complexity of equilibria are also intimately related to mathematical programming, online learning, and fixed point theory. More recently, equilibrium computation has been proposed as a means to learn generative models of highdimensional distributions. In this talk, we review fundamental results on minimax equilibrium and its relationship to mathematical programming and online learning. We then turn to Nash equilibrium, reviewing some of our work on its computational intractability. We conclude with modern applications of equilibrium computation, presenting recent progress and open problems in the training of Generative Adversarial Networks. The opinions expressed in this video do not necessarily reflect the views of the Heidelberg Laureate Forum Foundation or any other person or associated institution involved in the making and distribution of the video. 
Related Material
Series
Annotations
Transcript
00:01
[Music]
00:22
so we are moving to the next speaker of this session whose Costas Tosca Lucas he's from Greece but he's a professor at MIT and he's one of the very fresh winners this year so he received a novel in a price on August 1st this year in Rio and we are very happy that he could fit it into his schedule to come here at the first opportunity for him to be at HLF he's working in computation of the rehab with the eye interfaces to economics game theory probability and we are looking forward to his talk so causes the floor is yours [Music]
01:11
hope you enjoy your time enjoying so I wanted to talk about the interaction of game theory computational complexity topology and also present some very modern applications of game theory to machine learning there are things you know you like so without further ado let me talk about game theory so let's see
01:36
look at this game so this game I found online a few years ago it's a check game so it's played on this six colored board [Music] and it's played by two players who alternate moves so there's a talking but you place in the beginning of the game at the start position and the girl of one of the players is to bring it to the bottom right position to get the finish known well the other guy wants to avoid getting the talking there so now how's the game played well at the upper right corner of this picture you see that there was something that was cut off it was actually a hexagon whose six edges are colored with the six colors and what you do is you cut it you cut it out you put you put a toothpick through it and you use it you notate it to select a random color okay so once your turn to move you rotate that hexagon you choose a random color and then depending on where the token is currently say the token is here if you get a yellow you have to move over here but if you get a black you can choose to move either here or there and so on and so forth okay and then it's the other players turn to move so now again one player wants to get the token from the upper left to the bottom right the other player wants to avoid getting a talking there okay so that's a twoplayer game and it's a constant sound game okay so that that you know the total probability of getting the talking here is one the total probability you have is one and that probability is gonna be divided between the two players so now the question is who wins in this game so as it turns out a little argument and I invite you to do it later in your head is that actually the invader loses with probability one there is a strategy for the you know the hunter the guy who wants to get the talking out to the bottom right position there is a strategy that guarantees that the talking will end up there probability 1 ok so this particular game is very easy to solve now not all games are so easy to solve and that was the motivation of for Norman who in the 20s started thinking about game theory and he proved this very beautiful theorem called the minimax theorem the minimax theorem states that if you have a function f that's continues and it's convex in one of the arguments and concave in the other argument and the sets where these arguments lie are convex and compact then you have the two optimization problems are actually the same here they get the same value so the min of our X marks over Y of this function is the same as the max min of that function that's a remarkable start okay that that usually does not happen one of the two directed one one of the inequalities is obvious okay the max is certainly smaller than the mean but the fact that it's equal it's actually very surprising now what does this mean for game theory so I would like to interpret the situation as follows I want to imagine I have a twoplayer game between a player that I'm going to call min and who's controlling X and another player whose I'm gonna call Max who's controlling Y now for every pair of choices FX Y is the payment from min to max alright so now with this interpretation what the theorem says is the following that looking at the right hand side okay if the right hand side is smaller than some B star then this means that for all Y there is an X such that this function value is smaller than B star now the remarkable fact that it falls by the Equality of the righthand side of the left hand side is that if that happens there is actually a universal strategy for the mint layer so that the payment is bounded by this time right so if for every strategy of the max player there is a way for the min to pay at most of the star then there is actually a universal strategy for the min player so that no matter what the other guy does he pays at mostly star so that's pretty remarkable okay so in particular also if the X star is the optimum of this left hand side optimization problem y star the optimum of the right hand side and V star is the optimal value of both sides then the pair of strategies XR and Y star is an equilibrium of this game namely if min plays X star and march plays Y star a lot of the players can improve their payoffs so the min player cannot pay less the max player cannot gain more moreover the equilibrium set is convex and the payoffs under any equilibrium are the same movie star in particular so that's pretty cool and just to be concrete here's the game that you play you know like you played as kids the rockpaperscissors game at least this was the game that was played before electronic games okay okay so that guess has two players three strategies for players very symmetric so you can think of X the the the strategy set of the min player okay as the simplex of rock of rock paper and scissors and similarly the strategy set for the max player is the same and the payoff function of this game is just this quadratic form where a is this matrix this matrix is characterizing the payments of the player who's choosing the rows of this matrix to the player who's choosing the columns of this matrix for instance if the row player chooses scissors and the column player uses rock with and rock breaks scissors so the this guy has to pay in mind let's see I see so this is the I guess is the favored forgets my signs are unfortunately reversed but okay so it's the payment matrix so in an event the liberal of this game unifrom strategy for both of the players and under unifrom strategy which are hoping familiar with from at least the schoolyard no player wins anything in expectation or games anything in expectation cool but there's more to the story so when the minimax theorem was discovered then you know you know like soon after you know people started thinking about mathematical programming so down chicken for norman were talking to each other and they realized that the minimax theorem follows from strong linear programming duality in fact very recently it was shown that the other direction is true so the two probes are actually equivalent computationally solving games is the same as solving linear programs and vice versa all right and as we know this can be done in polynomial time and it's more beautiful than that starting again with the 50s that have been that have been you know people have started if there are dynamics distribute dynamics for the players to arrive at equilibrium and we have discovered you know like several dynamics the whole family of dynamics if run by the mean in the max player of the game is going to get them to who liberal so not it's not just that equilibria have this nice properties being a convex set and being a you know solvable in polynomial time yeah but also distributed dynamics get you there of course when you go to the real world
09:32
you know a life is been harder okay and that has to do with the fact that games that are interesting usually have many legal positions for instance chess has ten to the forty three positions go has ten to the hundred seventy legal positions approximately and as you know this is very large okay and it's even worse than that in poker because you also have incompleteness of information you don't know what the cards were dealt to your opponent but in any event so
10:08
still the fact that minimax equilibria have this nice structure allows us to actually exploit the structure and beat human players in this very complex games all right as you know in this in the 90s computers were able to beat Kasparov in chess and more recently remarkably the computers were able to beat the champion in God and even more recently professional poker players okay to play Texas Hold'em and this is not happened by accident this progress has been enabled by our understanding of the structure and the algorithms for computing equilibria in these games roughly speaking to you assume that your minimax equilibrium has some nice property some lower complexity form and then you try using the starting expert games or by self play try to tune the parameters of that low complexity strategy and as it turns out if you do that enough you actually can beat humans or you can compute basically better strategies done Shobana T has computed over the last centuries I mean go goes back to ancient times okay so many centuries cool so that's some minimax so some games are you know have a very interesting structure and which allows us to compute good strategies yet when
11:53
you go when you move away from this nice class life is harder okay we are really bad in making predictions about more complex games right in fact if we started if you studied data from a
12:13
complex environment you soon realize that the behavior of players is not consistent with a Nash equilibrium so in this plot here I'm showing you beat bidding behavior in adoptions from from this paper by negative C garnets and Tardos where you see that players no disease are not consistent with independent you know something from a randomized strategy as you know naturally we would predict so
12:46
what's happening in the you know in the more complex environments well just to lay the foundations John Nash in the 50s extended for nine months minimax theorem showing that every finite game also has an equilibrium in randomized strategies so what do this term mean okay so a finite game is one where the number of players is finite every player has a finite strata strategies and a pair function that map's everybody's actions to as like a utility and you allow players to randomize over their selves it said when an equilibrium is a collection of randomizations over its players strategy set so that for every player his randomized strategy is the best response to the product measure of the other player strategies okay so now show that every finite game has such an equilibrium in randomized strategies on the other hand a few things that phone lineman show did not actually go through so the equilibrium set is not convex now equilibria are could be many but they could be disconnected and the payoffs of the players may change across different equilibria of the game so the structure is not at all the same in fact there is a parallel between sort of like the development of von Neumann and you know the the developments in linear programming and learning that followed and the developer not Nash and there our inability to actually extend these other properties so both on Diamond and Nash use not constructive arguments to prove their theorems on the other hand while for Norman's theorem can be reproving using linear program in duality okay then giving rise to algorithms and distributed dynamics on the in more general against we have been unable to actually get a you know efficiently effectively constructive proof of the of the theorem in particularly after 60 plus years of research they have we don't know of a polynomial time algorithm for this problem and of course if you don't have a centralized algorithm you also won't get efficiently distributed dynamic because efficient distribute dynamics are even harder than having a centralized algorithm you also have them communication constraints and so on so forth also as I mentioned earlier are practical ability to solve multiplayer games is also much less impressive and which you know for somebody like me points to you know the question is there a computational complexity barrier that is preventing us to tackle multiplayer games so it's an honor to be in this
15:54
forum with some of the you know legends of you know computer science and in particular computational complexity but you know as you know the way we study the complexity of problems is by looking at the P versus NP paradigm that's the common way to do that so the researchers over the past several decades have classified problems into complexity classes and the most prominent ones are P for polynomial time and then P for nondeterministic polynomial time and on the picture I'm showing you some of the famous problems that belong in these classes right so importantly matching primary testing and linear programming can be solved in polynomial time while related problems like three part ID managing factoring and number into its prime factors and integer programming are actually you know we don't have polynomial time algorithms for them all we know is that they belong to the class NP and in fact some of these problems are as hard as any other problem in NP so what you know Steve cook showed in 71 is that the boolean satisfiability problem which is I give you a boolean formula and I ask you to find a satisfying assignment is npcomplete so as hard as any other problem in NP as any at least as hard as any problem in NP and Karp you know in next year provided the list of 21 and the complete problems including the Traveling Salesman into the programming tripartite matching and so on so forth so now the question is okay so the automatic response from the inability to extend the nice structure we know for zerosum games to general games is is it that be complete to find a Nash equilibrium okay that's the automatic response for somebody in my area and here the thing gets interesting because of the existence theorem provided by now so the existence theorem by Nash is making it a little bit strange to study the complexity of Nash equilibrium under the prism of the P versus NP problem in particular if you try to prove the Nash is npcomplete you quickly get into you know complexity class collapses like that NP and a related class called current be collapse okay and which we don't believe is actually the case so it's probably not it's probably Nash is not npcomplete so it has a similar you know behavior like factoring which is like also an unclassified problem that sort of like it's hanging out between P and NP and we don't know how to classify so start starting this problem what so you know trying to get into you know get to understand the complexity of Nash equilibrium you know but you know 10 or
19:03
more years ago with Goldberg at the poly material what we showed is that the complexity of finding Nash equilibria is the same computationally as the complexity finding browser six points okay so you know I hope you've seen browser fixed point theorem it says that if I give you a continuous function from some convex and compact set to itself there is a point that it stays fixed under the map okay so a natural computational problem is you know if I give you such a function to compute a fixed point so what we showed is that the Nash equilibrium so as I said so so now CH proves his theorem using browser switch points so what we do is we can do the opposite we can you know given an instance of the browser problem we can produce a games whose Nash equilibria are related to the fixed points of the function so that is a computationally efficient reduction between the two problems that establishes these two problems are actually computationally equivalent now going back to the picture that I showed
20:07
earlier it was known before our work that Brower is not just a problem that is hanging out between P and NP but it's actually complete for a complexity class that's called PPD so in particular what I feel and says is that unlike factoring whose computational complexity is not clear yet Nash equilibrium by being equivalent to brow fixed point theorem is actually complete for this complexity class PPD unfortunately I'm not going to define it but I would I will give you some pointers in the end of the lecture if you wanna you know see what it is but importantly you know if you have that your problem is complete for some class you can use complexity theory machinery to prove it equivalent a bunch of other problems and as it turns out falling from work that happened before us and after us Nash equilibrium is you know not just complete for PPD and not just equivalent with flour but it's equivalent with a host of other problems in combinatoric topology and economics so for instance I'm missing some of them here so sperner's lemma skaars lemma fractional hypergraph matching market equilibrium are all equivalent problems in Nash's theorem so that's the beauty of complexity theory once you prove your problem is complete for a class you can now you know you can try it showing it equivalent to other problems okay it's a picture we don't have for factoring okay but also showing that the problem is complete for a class identifies precisely the computational barrier into developing polynomial time algorithms for your problem and our PPD completeness is now basically identifies a combinatorial barrier for that you know you have to to find to compute equilibria and polynomial time and so but you know I claimed that the fact that Nass is equivalent to breyer and also complete for this class explains why we have been unable to get polynomial time algorithm so the problem and also why we have not been able to remove the use of brown's fixed point theorem in establishing Nash equilibrium so what this theorem says is that Brower is inherent in showing that Nash equilibria exist while it's not inherent for showing that minimax equilibria exist and that's why it was you know this the other theory and minimax was shown also with linear programming theology okay so you know after we prove the are
22:46
result people last you know like okay is this over for game theory and I think it's absolutely not it just makes the story much more interesting okay what follows from our result is that in a computational respecting world the universality of the Nash equilibrium breaks down okay so part of the reason why Nash equilibrium so prominent in economics is that it always exists at Universal so any game can be analyzed using Nash equilibrium so what we're saying is that complex games may have a Nash equilibria but that's not computable in polynomial time so which you know breaks down the the universality of the concept and what we have been advocating and you know over the last years it's becoming more and more prominent is that if you are about to use Nash equilibria to make a prediction about again you have to be you have to understand what makes Nash equilibria and your game actually computationally tractable okay so if your game is two player zerosum I have no problem with you using Nash equilibria but if you you know you don't know why a Nash equilibrium will arise then you shouldn't use NAS human predictions not equilibrium properties to actually argue about you know what will happen in that interaction now if your game fails to have good structure then you have to make it have good structure sometimes you can design your game so you can infuse your game with good structure that will allow equilibria to arise if you're unable to do that then you should switch the different solution concept okay maybe you should side dynamics of games or some other type of equilibrium but it's it's risky to actually design your you know analyze your game or design your game with a thinking that Nash equilibria will magically happen so this sort of like a brief overview of the complexity of Nash equilibria and now I want to turn the page and come to
24:49
something very very very recent okay as of you know four years ago and that's to do with what is called what are called generative adversarial networks where we have a very nice interaction between zerosum games and machine learning and the motivation for this development is as you know the archetypical question in machine learning which is the end probability theory and statistics which is developing models for realworld distributions okay so that you know there are k typical question in machine learning is i give you samples from some distribution of interest and i want you to come up with a model that you know i can use to generate samples from that same distribution so in other words i want to learn a generator of the generate similar samples like the real examples now when it comes to you know like if you want to estimate a multidimensional Gaussian that's an easy task usually because you can collect enough data to estimate the mean and the variance of the distribution however many distributions of interests are nonparametric okay so if I wanted to learn distribution over images that's not an easy task because you don't have a you cannot postulate a parametric description your distribution to then estimated parameters okay so a big open problem in machine learning is how do you get good models of interesting high dimensional distributions and one approach that has been suggested over the recent years those are scholars first of all instead of maintaining an explicit density for your distribution give that up and instead try to only find a generator that is a function try to find a function that will eat boring randomness namely isotropic Gaussian randomness and will spit out interesting around ominous namely samples from the distribution that you're interested in okay so in my picture here suppose I want to learn a distribution of the faces of people okay so what I want to do is I want to come up with the function f that we're you know that will eat a Gaussian sample and spit out a photo of a person okay now what you know like like you know like the recent trends in machine learning what has been proposed to use as a function to do this thing is to plug in a deep neural network okay not just anything I mean of course you have to pick a function from original family to implement interesting stuff so you know they're you know the recent for inner functions that people have been playing with our deep neural networks they have enough explicit expressive power that potentially can do this job so you design an architecture for a deep neural network you put it in there and you know you have to tune the parameters of that function  you know achieve your goal which is to generate samples like the ones that you see in the Indian data set I'm not going to describe how neural networks are that's not super important for what I'm worth describe but for you if you haven't seen what it is before think of a neural network as a parameterised function the texas input some low dimensional input okay say a hundred dimensional outputs a high dimensional art could say a thousand or a million dimensional it also has a lot of parameters like a million parameters what's important is that it's differentiable almost everywhere but respect to both theta the parameters and its input Z now you you know design you know you you you you design an architecture for a neural network and it has this free parameters Tara your goal is to look at samples from the target distribution the distribution of a real faces and tune these parameters so that the output distribution of your neural net matches the true distribution so and
29:17
if you do that you get actually pretty impressive results okay so in this picture I'm showing you on the right hand side let's focus on the right hand side what happens if you train models on distribution on a database of faces of real humans after you do your training you're actually you know after training us you know it's over now you can plug in Gaussian inputs and get random faces these random faces are not real people they're hallucinating people buy this neural net but these need a little learn the distribution from which human faces come are coming from so it cannot generate new faces so these guys you see over there are not real humans these were these are completely hallucinating people now if if I manage I'm going to show you the best that there is okay in
30:09
terms of this technology of course I cannot somehow play this video because my screen is different than your screen let's see oops I don't know what to do
30:31
oops
30:35
all right and object duplicates okay
30:47
okay this is the latest and greatest okay so so I'm going to show you what happens if you train such a thing on what is called the celibate data set the cellular data set is a data base with photos of celebrities okay so I'm showing you now I'm going to show you first the data the data set that I'm going to train on okay
31:08
these are real celebrities okay and
31:13
Celebi is a database with a lot of
31:15
headshots of real celebrities okay so
31:19
here are some samples from the database
31:22
okay and then what I'm gonna do okay so
31:29
this is not me I think it's a paper by Nvidia yeah that's what I'm gonna do is
31:34
I'm gonna use this database to Train
31:36
generate another sale Network and now I'm gonna show you what happens if i plug in random inputs too and look at
31:42
the output of that thing so now these
31:45
guys are not real humans this were these
31:48
are these are you know some like the little never dreamed of this before okay
31:54
you can see that it did a bad job at
31:56
some places like some of the boundaries are a bit fuzzy in stuff like like you
32:01
know that guy like in the middle right
32:02
right but you know it's doing a pretty
32:05
good job that's pretty remarkable okay these are some of the samples that it
32:11
generates it's pretty cool huh and in
32:17
fact but I'm not gonna get into that it
32:20
learns a lot about the geometry in which this faces actually lie okay so these
32:27
faces have nothing Euclidean geometry if I take a face in another face and average them in terms of averaging the pixel values I'm going to get some garbage in the middle so you shouldn't average in the Euclidean sense but I can use this neural nets to actually understand the geodesics of the space of images I can start with an image and another image and I can interpolate in the geodesic on you know the geometry that contains the symmetries so that's really actually cool okay in an event so more recently and before I get into the mouth actually this week there was the first auction run by Christie's were again created painting was sold to somebody against a French collector but this image that was actually generated by neural network okay and let me show you the signature of the painter the signature of the
33:23
painter you know it's a bit fuzzy it says min max of a function okay so it's min marks of something okay let me let me tell you why it says min marks of something so the way you actually set up
33:38
the training procedure is you actually define a two player zerosum game between two neural networks one you know network is the unilateral showing you earlier the one who gets garbage noise okay and and processes it and output pictures the other neural network is the judge he looks at inputs that were generated that were hallucinating by one of the two units I also takes as inputs in real real images and he has to distinguish you know is you know am I looking at the hallucinating image or am I looking at the elements okay so the judge is trying to distinguish between the outputs of the generator and the real distribution the generator is trying to fool the judge so the idea is that at the equilibrium of this two player zerosum game the generator should be doing a good job so in particular you set up a minimax problem where you know one of the two players is controlling the parameters this deep neural net the other player is controlling the parameters of that neural net and you write down a function f that basically expresses how well this judge is distinguishing between the hallucinating photos and the real photos okay let me show you there are many examples for what function to choose there but let me give you one function that you know I like that gives us to what is called a Vassar sting gun okay so that function basically runs a test okay so it uses the discriminator this guy it computes the expectation of the discriminated on a real image versus a hallucinating image so now this hallucinating image is basically you sample garbage you plug it into the generator and you get this hallucinate image and you run the test on that image and you compare this to expectations now if the discriminator could implement every mixes one function this soup of that function would be the vastest in distance between the two distributions okay but because I cannot range the discriminator over all possible lips is one functions I'm restricting myself okay so I'm giving him some expressive power but not all expressive power so he finds a different type of the soup is a different type of metric but it's you know closely related to the bus resting distance now this parameters theta and w that the two players have to set are very highdimensional there are a million dimensional okay or you know they're very high dimensional so you cannot solve this minimax problem you know centrally okay so you have to solve it in a very naive manner and what makes sense to do is to run a first order method so usually the way these things are trained is by having the inch player runs some version of gradient descent and the soup layer and some version of gradient ascent okay in tandem okay so you have the two players run a first order method only requires computing gradients you don't have to don't compute Hessians or anything and you hope that this will converge to the minimax equilibrium now the major issues that occur and I'm going to talk a little bit about one of the issues is you actually get cycling very calm if you run gradient descent ascent to solve the minimax problem it's unfortunately not stable okay so minimization problems running claim descent is you know you get your luck eloped but if you run gradient descent ascent for a minute max problem you actually cycle around the solution related to that the distribution that we're talking about here are very high dimensional a thousand you know ten thousand a million dimensional these are distribution of our images so you compute you know you use a bad training procedure okay to set the parameters of your of your gun you know like good luck you know that you get no guarantees so at least you'd like to evaluate the quality of your algorithm and you know a big challenge is you know getting rigorous statistical guarantees about what happened after training is over or you know it for statistical guarantees during training so recently I've been working on these two issues and I'm going to briefly talk about one of them about how that the optimization point of view how could you possibly remove cycling I'm going to give you some pointers to a video where I discuss also the second unit challenge alright so first of all let me show you
38:27
what happens you know what cycling is and how it comes about in training these things so here suppose that the training distribution is just a mixture of gaussians array whose lanes are arranged on a cycle okay suppose this is my training data a mixture of gaussians and what I'm showing in the bottom here a row is what distribution the generator generates are different steps of training and what you observe is that if I look at the distribution of the generator generates a different steps of training the generator is basically sampling different modes of the Gaussian distribution but it never samples from the mixture so you know like the sort of like the you know the discriminator is hunting the generator who's switching modes of the disc but in this example it never actually converges to the mixture in a more
39:21
interesting example so here I'm training in the mixture of hundred and digits ok so 10 you know have 10 distributions over 100 and digits and I'm training my generator discriminator on the mixture of those handwritten digits and again I'm showing you here what happens at different steps of training I'm not focusing on any particular node but I'm you know focusing on what the authors call pirata digits so you see like if I stop at different steps of training my distribution has you know this mod collapse property okay so this you know this training of solutions are very common here and again I emphasize that this is unlike minimization problems so gradient descent is the method reserved for training deep neural net classifiers and if you have a minimization problem you don't really descend ok you're not gonna get a global of but you will get to a LOC a lot on the other hand when you use gradient descent ascent in
40:19
tandem to solve a minimax problem all bets are off so you get cycling even if your function f is convex concave which is the fundament case which is the best case scenario for min max problems so we have been thinking about how to remove these oscillations and so we have some results interestingly so what we show is that if instead of doing gradient descent ascent you do graduate the center sent with with negative momentum and you know I'll explain what that means you you you know and if your function is convex ok so that's the best case scenario you will convert the minimax equilibrium even even if you have becomes if you're solving a constraint minimax problems where your parameters live in you know to given polytopes and so in particular even if that's us as basically a strong as doing linear programming yeah except now you have to do projected gradient descent because gradients tells me take you outside of your politics you have to project back but other than that if you do a claimed gradient descent with negative momentum it's actually good for you so the surprising thing is that usually in minimization problems positive momentum is good for you because it smooths out but possibly rugged terrain where whose gradients may be pushing you in all sorts of different directions if you maintain momentum you follow a smoother path to the local opt as it turns out for minimax problems negative momentum is good for you so that was the one of the surprising aspects here and you know like you know
42:06
I'm close to being over but let me try to offer you some cartoon I haven't given I guess any proof in this talk but let me give you one cartoon at least proof so I'm going to look at the most trivial possible case of a minimax problem so you have scaler theta and W and you have a linear function theta times W basically the only reason I'm centering it is that my pictures were centered at 0.5 s but other than that this theta times W so if you want to do gradient descent and ascent for the min and the max players of this game it boils down to these equations ok so the theta guy is doing gradient descents in particular at time step T plus 1 he updates his car and theta by subtracting the gradient with respect to theta of this function and this guy the su player does get a sense he adds the gradient to his current value ok the gradient with respect to his own solid you now if you like so this function the unique minimax equilibrium is the point five point five point okay in any other point you know the other player can exploit you so you have both players have to play 0.5 to be our equilibrium nevertheless if I do gradient descent and starting say from here I will spiral out of the to infinity okay in this unconstrained problem so great in December Sun has the spiraling out behavior usually in discrete time let me explain and that's my car to improve why negative momentum could be good for you and you can actually make that into math first what is negative momentum negative momentum is instead of following those dynamics here that I'm showing up there I'm actually undoing some of yesterday's gradient in my current step okay so you know this would be normal gradient descent but I'm doing a little bit that's one step size a little bit half the step size of my yesterday gradient and for the ascent player I'm also undoing some of the yesterday gradient which basically if would this would be the trajectory of normal gradient descent it would mean that basically instead of following the gradient step here I will undo a little bit of yesterday's gradient which is this one this is gonna push me slightly inside which you know like if you do the math will get you a nice trajectory converging to the minimax equilibrium this is roughly speaking Y negative momentum is good for you right so like if you if you're going around the star and he's spiraling out if I give you some negative momentum I can push you to convert to the star okay which is in the middle that's kind of the intuition okay
44:57
but in that leads me to a very interesting approach this is what happens if your function is not convex concave now we have no idea what's happening here okay so that's a big a very interesting open problem getting a first order method some variant of gradient descent that is guaranteed to get to a local minima solution for a minimax problem okay again it was just a minimization problem gradient descent will get you to a local opt but because the you know even even if the function isn't concave but here in a mini marts problem no method no firstorder method variant of gradient descent that can get you to minimax solution so that's a very interesting in my opinion open problem so I would like to conclude I mean so we
45:53
can still apply this method despite not having properly no proofs of convergence in practice and we actually do get improvement in practical settings but I would like to conclude so what I've talked about today I guess I gave an overview starting from from Norman to now of the uses of game theory in you know game theory in machine learning so what I told you about is that Nash equilibria are people complete and in particular are equivalent to a bunch of other problems in game theory economics combinatorial topology and when Nash equilibria are intractable you should not trust their predictions okay chances are players are not going to be able to get there and chances are you are also not going to be able to find the equilibrium ok so it's important to identify families of games where a Calibri are tractable or you know understand you know how to do game theory using alternative solution concepts and emerging on the complexity side I'm going to state now but of open problems with a find very interesting so there's an emerging literature trying to show other its case hardness of PPG in other related classes in between and P and P N and P one very interesting open problem and there are reasons to believe this should be true is showing that factoring this problem that I showed you earlier that was floating around between P and P showing that is actually in PPD okay factoring actually ends up being included in two cousins of the class PPD whose natural in is you know class in the intersection of these two classes is PPD so it would be really interesting to show that factoring is there on the flip side of that so that would be sort of like using cryptographic primitives to argue about the average case come laxity of PPD on the flip side of that is assuming an average case complexity of PPD try to show try to develop cryptographic methods okay so the graphic methods should be founded on the complexity on complex on average case complexity assumptions and you know it would be very compelling if we could base cryptography on the average case hardness of you know PPD and one of the related classes on the game theory and deep learning style I told you that you know there's you know a lot of excitement in building games that big humans and also in creating these you know guns this generated other serial networks that you know learn intricate distributions of our pictures and I stayed a bunch of open problems there so one is understand the optimization landscape of minimax problems we have know between you know we don't know how to you know we don't have first order methods that get there and also to develop statistical guarantees for games as I told you you know the distributions that we're after are high dimensional so you know we would like to get statistical guarantees about the properties of our you know generators on the sort of like you know challenge if you one side you know like there has been a lot of excitement about beating human players in chess and go but you know and you know twoplayer Texas Hold'em but and this would be consistent with my theory you know I've I think that you know like moving outside of the class of you know nicely behaved two player zerosum games is a whole different game okay it's a whole different ballpark and you know as a challenge to the community I would be super interesting interested if somebody could develop good you know three players you know Texas Hold'em or you know like you know deep learning Bay's or you know other methods that bit humans in three player games okay and you know here are some further pointers that I promised if you want to hear more about you know both optimization in the statistical if on guns you can look at you know a recent talk that I gave at MIT on the complexity Libya is a video where I expand on what PPD is and the developments and the complexity front there is a video from Simon's Institute of theoretical computer science and you can also look at my ICM proceedings article for you know more details and little and you know references and a lot of open problems around is in this area thank you very much thank you very much for a great talk we are running late but we have the possibility of questions [Applause] yeah okay yeah right okay good so right so you know as you said positive momentum is considered to be good for you why because it's smooth smooth and it's a racket domain so I thought the classical image coming down a mountain and following gradients you know if it's rugged then you know maybe you know one day gradient pushes you this way the next day that way and so so far so you'd be very slow going back and forth while getting momentum would get you down you know down the slope faster right so this method but you know like what I showed you you understood the hope why negative positive men would be bad for you right so you're already going around this you know a cycle or a spiral with positive momentum you're gonna get into a bigger spiral and negative is the opposite of that so intuitively it makes sense now if you're sort of like if if if if your function is strongly convex concave you can show that you don't need to do negative momentum that great you know just great like if you strongly if you strongly convex then you know gradient will take you to the bottom right and then you know I don't know like optimizing with respect to you know it sounds to me that you know negative momentum would be always good for you but you know now on the optimization front there have been cases where actually negative momentum is good for you and that arises when you're doing not any type of great descent but if you bring stochastic gradient descent so if you're you know training a big neural network you know you cannot take gradients with respect to the whole function but you you know you look at you know functions that are separable so it's a sum of a bunch of functions and usually you sample small sample of them and you do gradient only with respect to that sub sample the objective function and you know then you know there is some bias like you know like there is you know an expectation you get the right gradient but there are some biases that sometimes negative momentum can correct so it has been argued for different reasons that in SGD a negative momentum could actually help you it was another question over there so now you can try to throw it
53:55
it's good yes so thank you very much for this talk and quite nice to see these things connected I have anyway a simple yes/no question so is it the case that the a good metaphor for this face generation random face generation is that you sort of think of it as taking the faces as inputs reverseengineer their DNA mutates the DNA and then regrow the person and render that face yeah yes and no I would say yes it's a it's a you know it's a it's a nuance yes yeah so we're trying to do essentially what imagination does or what genes do right so genes you know like map random random genome to a face okay but the random genome should be you know should be valid genome okay but as long as it's a valid genome it will give rise to a face and when you have we know sexual introduction or mutations you know like you know you inherit random face related to your parents and so forth right so but yeah this process or what your imagination does when you want to imagine you close your eyes you want to imagine the random celebrity or you know some other face this what you do you map random bits from your century you know inputs and you know somehow you use those to hallucinate random face so this is what we try to replicate here yeah thank you okay I think we will have to stop here cause this one's more of a great [Applause] [Music] you
00:00
Internet forum
Neumann boundary condition
Musical ensemble
Scheduling (computing)
Computational intelligence
Computer network
Point (geometry)
Game theory
Computational intelligence
Computational complexity theory
Computational complexity theory
01:09
Logical constant
Axiom of choice
Musical ensemble
State of matter
Direction (geometry)
Equaliser (mathematics)
Thermodynamic equilibrium
Parameter (computer programming)
Disk readandwrite head
Expected value
Maxima and minima
Duality (mathematics)
Sign (mathematics)
Machine learning
Video game
Strategy game
Computer network
Convex set
Matrix (mathematics)
Aerodynamics
Position operator
Normalform game
Optimization problem
Point (geometry)
Instance (computer science)
Thermodynamic equilibrium
Functional (mathematics)
Category of being
Neumann boundary condition
Arithmetic mean
Hexagon
Network topology
Graph coloring
Theorem
Right angle
Whiteboard
Row (database)
Open source
Distribution (mathematics)
Dynamical system
Control flow
Inequality (mathematics)
Event horizon
Simplex algorithm
Linear programming
Uniqueness quantification
Theorem
Game theory
Mathematical optimization
Linear map
Window
Compact space
Interactive television
Set (mathematics)
Cartesian coordinate system
Quadratic form
System call
Computational complexity theory
Berechnungskomplexität
Continuous function
Convex set
Interpreter (computing)
Game theory
Mathematical optimization
Family
09:30
Computer chess
Beat (acoustics)
Parameter (computer programming)
Computational intelligence
Event horizon
Computational complexity theory
Video game
Strategy game
Information
Data structure
Position operator
Form (programming)
God
Algorithm
Information
Expert system
Computer
Instance (computer science)
Thermodynamic equilibrium
Category of being
Computational intelligence
Smart card
Theorem
Complex system
Game theory
Arithmetic progression
11:53
Ring (mathematics)
Thermodynamic equilibrium
Prediction
Thermodynamic equilibrium
Plot (narrative)
Prediction
Video game
Strategy game
Integrated development environment
System programming
Negative number
Game theory
Complex system
Game theory
Multiplication
Social class
12:45
Group action
Randomization
Boolean algebra
1 (number)
Thermodynamic equilibrium
Matching (graph theory)
Parameter (computer programming)
Mereology
Thomas Kuhn
Duality (mathematics)
Strategy game
Internet forum
Aerodynamics
Boolean satisfiability problem
Social class
Area
Algorithm
Programming paradigm
Polynomial
Product (category theory)
Constraint (mathematics)
Theory of relativity
Software developer
Electronic mailing list
Parallel port
Bit
Measurement
Functional (mathematics)
Thermodynamic equilibrium
Proof theory
Category of being
Neumann boundary condition
Wellformed formula
Bipartite graph
Telecommunication
Duality (mathematics)
Simplex algorithm
Computer science
Theorem
Code refactoring
Quicksort
Complexity class
Polynomial
Finitismus
Existence
Vapor barrier
Divisor
Algorithm
Dynamical system
Infinity
Equivalence relation
Number
Wellformed formula
Utility software
Linear programming
Theorem
Software testing
Maize
Data structure
Game theory
Mathematical optimization
Linear map
Prime factor
Set (mathematics)
Existence
Prime number
Computational complexity theory
Berechnungskomplexität
Bargaining problem
Integrated development environment
Function (mathematics)
Strategy game
Dependent and independent variables
Complex system
Game theory
Mathematical optimization
Integer
Capability Maturity Model
19:03
Point (geometry)
Complexity class
Polynomial
Graphics tablet
Vapor barrier
Lemma (mathematics)
Thermodynamic equilibrium
Matching (graph theory)
Web browser
Complete metric space
Computational intelligence
Equivalence relation
Computational complexity theory
Linear programming
Convex set
Reduction of order
Theorem
Continuous function
Game theory
Social class
Compact space
Algorithm
Point (geometry)
Lemma (mathematics)
Polygon
Instance (computer science)
Thermodynamic equilibrium
Functional (mathematics)
Equivalence relation
Prime number
Computational complexity theory
Pointer (computer programming)
Bargaining problem
Network topology
Convex set
Bipartite graph
Code refactoring
Combinatorics
Complex system
Game theory
Mathematical optimization
22:42
Randomization
Scientific modelling
1 (number)
Thermodynamic equilibrium
Function (mathematics)
Parameter (computer programming)
Open set
Computer simulation
Mereology
Medical imaging
Machine learning
Computer network
Multiplication
Descriptive statistics
Software developer
Sampling (statistics)
Computer simulation
Parameter (computer programming)
Prediction
Thermodynamic equilibrium
Functional (mathematics)
Category of being
Arithmetic mean
Digital photography
Process (computing)
Sample (statistics)
Theorem
output
Freeware
Data type
Data structure
Resultant
Statistics
Probability theory
Computergenerated imagery
Distribution (mathematics)
Artificial neural network
Virtual machine
Control flow
Dynamical system
Power (physics)
Twitter
Goodness of fit
Population density
Hardy space
Data structure
Gamma function
output
Subtraction
Game theory
Distribution (mathematics)
Task (computing)
Computer architecture
Differentiable function
Artificial neural network
Interactive television
Variance
Set (mathematics)
Berechnungskomplexität
Similarity (geometry)
Population density
Word
Bargaining problem
Function (mathematics)
Universe (mathematics)
Nonparametric statistics
Complex system
Game theory
Distribution (mathematics)
Family
29:15
Randomization
Focus (optics)
Electronic data interchange
Touchscreen
Real number
Scientific modelling
Computergenerated imagery
Weight
Wave packet
Sample (statistics)
Objectoriented programming
Term (mathematics)
Function (mathematics)
Database
Videoconferencing
output
Right angle
Distribution (mathematics)
Resultant
30:30
Digital photography
Objectoriented programming
Cellular automaton
Database
Computergenerated imagery
Source code
Set (mathematics)
31:14
Randomization
Real number
Computergenerated imagery
Sampling (statistics)
Bit
Function (mathematics)
Wave packet
Database
Source code
output
Fuzzy logic
Boundary value problem
Plugin (computing)
31:59
Pixel
Spacetime
Geometry
Sampling (statistics)
Weight
Event horizon
Euklidische Geometrie
Geodesic
Electronic signature
Medical imaging
Goodness of fit
Process (computing)
Symmetry (physics)
Term (mathematics)
Right angle
33:22
Gradient
View (database)
Range (statistics)
Function (mathematics)
Parameter (computer programming)
Computational intelligence
Weight
Expected value
Maxima and minima
Medical imaging
Videoconferencing
Bus (computing)
Statistics
Algorithm
Real number
Gradient
Sampling (statistics)
Bit
Functional (mathematics)
Thermodynamic equilibrium
Inclined plane
Metric tensor
Digital photography
Process (computing)
output
Lipschitz continuity
Cycle (graph theory)
Procedural programming
Data type
Point (geometry)
Statistics
Real number
Computergenerated imagery
Distance
2 (number)
Wave packet
Power (physics)
Revision control
Software testing
Game theory
Gradient descent
Distribution (mathematics)
Units of measurement
Mathematical optimization
Artificial neural network
Distribution (mathematics)
Firstorder logic
Wave packet
Pointer (computer programming)
Hessian matrix
Noise
Game theory
Gradient descent
38:25
Maxima and minima
Asynchronous Transfer Mode
Greatest element
Numerical digit
Computergenerated imagery
Programmable readonly memory
Distribution (mathematics)
Weight
Wave packet
Oscillation
Mixture model
Case modding
Authorization
Subtraction
Distribution (mathematics)
Distribution (mathematics)
Sampling (statistics)
Category of being
Circle
Function (mathematics)
Mixture model
Vertex (graph theory)
Cycle (graph theory)
Digitizing
Normal distribution
Asynchronous Transfer Mode
Row (database)
Gradient descent
40:18
Point (geometry)
Maxima and minima
Momentum
Gradient
Direction (geometry)
Dynamical system
Thermodynamic equilibrium
Parameter (computer programming)
Oscillation
Maxima and minima
Polytop
Mathematics
Linear programming
Convex set
Negative number
Subtraction
Gradient descent
Position operator
Constraint (mathematics)
Uniqueness quantification
Gradient
Trajectory
Bit
Functional (mathematics)
Thermodynamic equilibrium
Polytop
Stokes' theorem
Proof theory
Linearization
Equation
Normal (geometry)
Momentum
Quicksort
Game theory
Thetafunktion
Resultant
Fundamental theorem of algebra
Gradient descent
44:57
Computer chess
NPhard
Building
State of matter
Gradient
Model theory
Primitive (album)
Thermodynamic equilibrium
Open set
Computational complexity theory
Expected value
Maxima and minima
Medical imaging
Stochastic
Pointer (computer programming)
Cryptography
Convex set
Videoconferencing
Statistics
Pairwise comparison
Position operator
Social class
Family
Area
NPhard
Theory of relativity
Electric generator
Smoothing
Software developer
Gradient
Sampling (statistics)
Parameter (computer programming)
Prediction
Term (mathematics)
Functional (mathematics)
Thermodynamic equilibrium
Open set
Proof theory
Category of being
Exterior algebra
Network topology
Right angle
Cycle (graph theory)
Data type
Alpha (investment)
Thomas Bayes
Combinatorics
Maxima and minima
Domain name
Topology
Statistics
Graphics tablet
Identifiability
Momentum
Divisor
Algorithm
Parametrische Erregung
Distribution (mathematics)
Virtual machine
Spiral
Average
Equivalence relation
Hypercube
Oscillation
Social class
Polytop
Implementation
Best, worst and average case
Subtraction
Game theory
Combinatorics
Mathematical optimization
Metre
Artificial neural network
Distribution (mathematics)
Firstorder logic
Archaeological field survey
State of matter
Set (mathematics)
Cryptography
Theoretical computer science
Computational complexity theory
Stokes' theorem
Summation
Pointer (computer programming)
Bargaining problem
Function (mathematics)
Complex system
Momentum
Negative number
Game theory
Mathematical optimization
Family
Gradient descent
53:48
Internet forum
Musical ensemble
Randomization
Electric generator
Process (computing)
Inheritance (objectoriented programming)
Mapping
Volumenvisualisierung
output
Bit
Quicksort