Approximating Values of Generalized-Reachability Stochastic Games
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Serientitel | ||
Anzahl der Teile | 56 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/49286 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
6
13
45
00:00
StochastikSpieltheorieLokales MinimumApproximationMAPCoxeter-GruppeVollständigkeitSpieltheorieStochastikXMLComputeranimation
00:24
SpieltheorieStochastikAggregatzustandMultiplikationSchnittmengeEndliche ModelltheorieObjekt <Kategorie>MultiplikationsoperatorCAN-BusWort <Informatik>
00:44
SpieltheorieStochastikWald <Graphentheorie>Pareto-VerteilungIterationApproximationAlgorithmusApproximationInformationSpieltheorieStochastikGlobale OptimierungInverser LimesMarkov-EntscheidungsprozessAbstandStrategisches SpielGebundener ZustandPunktRichtungSchätzfunktionPareto-VerteilungObjekt <Kategorie>MultiplikationsoperatorOrdnung <Mathematik>Generator <Informatik>Lokales MinimumExogene VariableNeuroinformatikMinimalgradOrtsoperatorComputeranimation
03:18
RichtungGruppenkeimDialektParametersystemCASE <Informatik>PunktRichtungEinfache GenauigkeitKontextbezogenes SystemComputeranimation
04:01
Strategisches SpielBeschreibungskomplexitätEntscheidungstheorieStationäre VerteilungSpieltheorieStochastikKontextbezogenes SystemRekursionComputerphysikMinkowski-MetrikMarkov-EntscheidungsprozessMultiplikationEinfache GenauigkeitDatenmodellUnendlichkeitApproximationSpieltheorieHalbleiterspeicherDimensionsanalyseFinitismusEntscheidungstheorieDeterministischer ProzessKomplex <Algebra>Markov-EntscheidungsprozessTabelleTermCASE <Informatik>Strategisches SpielVollständigkeitPunktSchnittmengeMultifunktionEndliche ModelltheoriePSPACEOrtsoperatorApproximationRelativitätstheorieBitRandomisierungEinfügungsdämpfungSchlüsselverwaltungEinfache GenauigkeitKappa-KoeffizientComputeranimation
05:50
IterationRationale ZahlSchaltnetzDimensionsanalyseAggregatzustandAuswahlaxiomBitGruppenoperationInverser LimesKonvexe MengeLokales MinimumMaßerweiterungZellularer AutomatZeitrichtungParametersystemGemeinsamer SpeicherKreisflächeShape <Informatik>ZoomVierzigBellmansches OptimalitätsprinzipDreiecksfreier GraphEinsMarkov-EntscheidungsprozessSkalarproduktNichtlinearer OperatorPunktQuaderPareto-VerteilungComputeranimation
08:07
ApproximationAggregatzustandEndlichkeitInverser LimesLokales MinimumMarkov-EntscheidungsprozessSpeicherabzugCASE <Informatik>Zusammenhängender GraphShape <Informatik>SchätzfunktionBellmansches OptimalitätsprinzipDreiecksfreier GraphIterationZellularer AutomatBeobachtungsstudieBitrateComputeranimationTechnische ZeichnungDiagramm
09:15
SpeicherabzugKomponente <Software>AlgorithmusSpieltheorieStochastikAlgorithmusAnalysisApproximationInformationOrdnung <Mathematik>Rationale ZahlSpieltheorieStatistikIterationDimensionsanalyseEntscheidungstheorieAggregatzustandAuswahlaxiomBildschirmmaskeGeradeLeistung <Physik>Lokales MinimumMaßerweiterungMomentenproblemWinkelZahlenbereichZeitrichtungCASE <Informatik>Zusammenhängender GraphPunktSchnittmengeWort <Informatik>RichtungSchätzfunktionMinimumMixed RealitySystem FObjekt <Kategorie>MinimalgradDreiecksfreier GraphEinfache GenauigkeitDialektMarkov-EntscheidungsprozessPareto-VerteilungComputeranimationDiagramm
15:28
SpeicherabzugKomponente <Software>DialektAggregatzustandGruppenoperationLokales MinimumZusammenhängender GraphPunktSchnittmengeRichtungMinimumSystem FMinimalgradSoftwaretestDimensionsanalyseGeradeMaßerweiterungCASE <Informatik>ComputeranimationDiagramm
16:58
DimensionsanalyseEinfache GenauigkeitDisjunktion <Logik>System FComputeranimation
17:14
DimensionsanalyseApproximationDialektDimensionsanalyseKomplex <Algebra>Lokales MinimumAutomatische HandlungsplanungPunktSystem FObjekt <Kategorie>Einfache GenauigkeitKontextbezogenes SystemElementargeometrieProjektive EbeneSimplizialer KomplexQuaderComputeranimation
18:24
SpieltheorieAlgorithmusApproximationOrdnung <Mathematik>SpieltheorieStochastikDialektDimensionsanalyseEntscheidungstheorieMarkov-EntscheidungsprozessCASE <Informatik>HypercubeZusammenhängender GraphPunktPareto-VerteilungBellmansches OptimalitätsprinzipEinsStatistikBewertungstheorieTermAbgeschlossene MengeSystem FMultiplikationsoperatorComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:11
Welcome to our presentation of the paper Approximating Values of Generalized Reachability Stochastic Games. We start by giving the complete high-level picture in the first five minutes.
00:24
We are considering the model of stochastic games with generalized reachability objective. In other words, we have multiple sets of target states and we want to reach all of them with the highest possible probability. However, our opponent tries to minimize our chance of doing so.
00:43
For example, we might want to maximize the probability of reaching target set T1 while at the same time maximizing the chance of reaching T2. However, reaching both of them surely at the same time is not possible because the goals are partially conflicting.
01:00
To capture this trade-off, one can construct the so-called Pareto frontier, which shows all possible Pareto optimal outcomes. In this example, if we focus only on this objective T1, we can satisfy it with probability 1. If we focus only on T2, we can achieve a probability of 0.7.
01:23
If we try to consider both objectives equally weighted, this corresponds to the 45° direction in the picture and the green X marks the best probabilities we can achieve. Every point on the frontier is a point-wise maximum, thus all these points are incomparable.
01:42
Additionally, every point on the frontier corresponds to a strategy that allows us to achieve it. We want to compute such a Pareto frontier because it gives us the best information to decide our strategy. Previous work has shown how to approximate it from below using value iteration.
02:02
Value iteration from below works as follows. We initially estimate we can reach nothing and then iteratively compute an ever more precise approximation of the frontier that converges to the true one in the limit. However, there was no stopping criterion for general games and we had no way of knowing how close the approximation is.
02:26
Our contribution is a way to compute a convergent over-approximation as well. Starting with the estimate that we can reach everything surely, we iteratively decrease the upper bound. Thus, we know how close our approximation is and we can give an anytime algorithm.
02:46
That is, an algorithm that can be stopped at any time returning our current estimates as well as their precision. If we are given a precision epsilon, we can also run the algorithm until the distance between the under and over approximation is at most epsilon in all directions.
03:05
And thus, we can approximate the values of generalized reachability stochastic games for arbitrarily small precision. The biggest problem we had to solve was the following. The over-approximation need not converge to the frontier.
03:22
Instead, it might stay at some greater fixed point. To overcome this problem, we had to consider not the whole frontier at once, but certain individual directions. There, we could apply the solution for the single-dimensional case. We needed some geometrical argument to identify regions of the frontier so that by
03:45
computing finitely many single reachability problems, we were able to piece together the whole frontier. So this picture summarizes our contribution. However, to know the big picture, I have to tell you about the context.
04:02
This table gives you the computational and strategy complexity of some related models. For stochastic games with a single target, the problem is an NP intersect coNP. For multidimensional MTPs, games with only a single player, it is PSPACE complete.
04:22
When having multiple dimensions and two players, so far we do not even know whether the corresponding decision problem is even decidable. In terms of strategy complexity, for single-dimensional games, memory-less deterministic strategies suffice. For multidimensional MDPs, we need either randomization, in the easier case with absorbing targets, or finite memory in general MDPs.
04:49
For multidimensional games, however, we need infinite memory, even if the targets are only absorbing. Let me emphasize one more thing. In our setting, it is highly relevant whether we fix our strategy first or our opponent.
05:06
In MDPs and single-dimensional games, this distinction is not relevant. We consider the worst case, so we fix first and our opponent can react. This case is theoretically harder. So the problem we consider is fundamentally difficult, and proving that the frontier can be approximated to arbitrary position is a big step.
05:29
As we said before, the key difficulty is that the over-approximation need not converge to the true frontier, as there can be multiple fixed points above it, where value iteration gets stuck.
05:42
To understand why this happens, let's first recall how value iteration works, in particular in the multidimensional setting. Here, we have states of the minimizer, marked by a circle, and of maximizer, marked by a box. The arrows are our actions, and the little black dots show that probability comes into play.
06:06
For example, from this state, with a certain probability, say ½, we reach a state in target set T2, and with the remaining probability, we reach a state that is both in T1 and T2. Hence, the Pareto frontier looks like this. The probability for T2 is 1, as it is certainly reached, and the probability for T1 is ½.
06:32
By a similar argument, we get this Pareto frontier down here. The minimizer state here now has the choice between these two frontiers.
06:41
Now we have to ask, what is the worst thing minimizer could do to us? If we care about T1, minimizer would go up, giving us only ½. Dually, if we care about T2, minimizer would go down, giving us only ½. More generally, a minimizer state takes the intersection of all possible actions.
07:04
On the other hand, we as maximizer take the union, as we always pick the action that yields the highest probability. Moreover, we can even mix the actions, take any convex combination, if it yields a better probability.
07:21
For example, taking both actions with ½ yields this point. What you just saw is the extension of the so-called Bellman update to multiple dimensions, and it is the basic operation of value iteration. By backpropagating these frontiers, we were able to figure out what we can achieve where.
07:43
So, what is the problem? Cycles. If we have any kind of cycles, things become more complicated. Let's first look at probabilistic cycles. For this, we introduce this self-loop here. So now, we could either go to one of these target states, or we could stay in this state.
08:05
Let's zoom in to look at only this action. And now, we will see why value iteration converges only in the limit. We start by estimating that our maximizer state here can reach nothing. If we do the Bellman update once, we see that our estimate has the same shape
08:22
as the true frontier, but there is the probability that goes back to ourself and yields nothing. Iterating again, we know that going back yields something, so our estimate becomes bigger. Doing this again and again, we get ever more precise, but for any finite number of iterations, there is a positive probability to always use this self-loop and stay here.
08:47
So this is why we converge only in the limit. We can do the same thing for the upper bound, starting by saying that we can reach all targets with probability one, and then reducing our estimate by the Bellman update.
09:03
You see that in this example, both approximations converge to the frontier. However, this is not the case if we have cycles that can occur surely, so-called end components. These are the core of the problem. An end component is a set of states where the play could stay forever.
09:24
For example, if minimizer goes up here and we always go down, the play will stay in these two states forever. So, because it is possible to stay in these two states forever, they form an end component. Similarly, these two states form an end component, and actually also all three
09:44
states form an end component, because maximizer could always stay and minimizer could randomize. These end components are the reason that value iteration does not converge from above. This is a well-known phenomenon, also from the single-dimensional case and even if there is only a single player.
10:03
To keep things digestible, let's add the complications one by one. So, let's consider a single-dimensional objective, where these values on the arrows denote the true values, the true probabilities to reach the target. For the moment, we assume we already know them.
10:23
Then, if we iterate from above and start at 1, minimizer cannot decrease the estimate here, as all the successors also have the estimate of 1. We as maximizer see the 1 and we prefer it to the exits that are available to us, I mean 1 is the higher number.
10:44
What we do not realize is that staying in here, we will actually never reach the target. The solution for the single-dimensional games is to realize that we are making this error and tell all states in the end component that they must not look at each other, but they have to look at their exits.
11:04
Let's decrease the values of all these states to that of the best exit, so in this case to the 0.7. Alright, so this one is correct now, but these two are actually still too high. To see why, we need some more in-depth analysis.
11:22
We need to realize that the minimizer state here will always go down and that these two states cannot depend on the top exit here. In the words of the paper, these two states form a simple end component and we have to deflate the estimate of all states in it to the value of the bottom exit.
11:44
And now we get the correct values. So let's take a step back, what have we seen? In cycles, so-called end components, the over-approximation might not converge, as the states depend on each other, not on the exits.
12:03
So we want to deflate those states, decreasing their values according to the exits they can reach. In order to do so, we first have to find the so-called simple end components. They tell us which states can reach which exit. And to find simple end components, we have to look at the choices of minimizer.
12:25
Notice that we fixed minimizer's choice depending on the values on these arrows here. And earlier we said, we assume we know the true values. But actually we want to compute the true values, so we can't do that. In the algorithm, we guess minimizer's choice according to the current estimates that we have, but those estimates might still change.
12:48
Assume that after one more iteration, we figure out that the upper exit can actually only achieve 0.4. Now minimizer prefers the upper action, and then these two states form a simple end component and we have to deflate accordingly to 0.4.
13:05
In the single-dimensional case, this was no big deal, as deflating is always sound. We can change our decision as often as we like, as long as eventually we decide correctly. But in the multi-objective setting, we can have the situation that for one objective these two
13:23
form a simple end component, and for the other objective these two form a simple end component. Since we are talking about multiple dimensions now, let's replace the numbers with Pareto frontiers. And we use the frontiers that you saw before. At the upper exit, we can reach T1 surely, but T2 only with one half, and at the bottom exit vice versa.
13:47
If we focus on T1, minimizer wants to go down to restrict the value to one half. Then we get the red simple end component, and we can actually deflate and say that in
14:00
the direction of T1, these two states get one half, and this one here can achieve 1. Dually, if we focus on T2, minimizer prefers to go up, and we have the blue simple end component. The states in the blue simple end component get only one half in this direction, and the bottom state can achieve 1.
14:23
Okay, so actually for these two directions, we just figured out what happens. We can actually deflate them as in the single dimensional case. But there are uncountably many directions between them. What do we do with those?
14:40
Well, intuitively, we see that for this whole part, for all the directions in the blue region, minimizer will choose to go up in order to restrict the value to one half. And thus, minimizer forms the blue simple end component. And dually for the red region, minimizer wants to go down and form the red simple end component, restricting the value in these directions to one half.
15:04
And actually, for the dashed line at the 45 degree angle, minimizer is indifferent, and all three states form a simple end component. Okay, now we have some information about every direction, and we did this by not only talking about single directions, but about whole regions.
15:24
We can piece together the frontiers region by region, and we get this picture. For the blue region, the states in the blue simple end component get the frontier of the top exit, while the bottom maximizer state wants to use its own exit.
15:40
Dually, for the red region, the red simple end component uses the bottom exit, while the upper state wants to use its own exit. And for the dashed 45 degree line, all states agree on the point one half, one half. Okay, so this worked on an example, but how does it work in general?
16:02
To answer this, notice that to find the simple end components, we had to ask, what will minimizer choose? And inside the regions, minimizer always chooses the same action. So formally, a region is a set of directions where minimizer's preference over the actions is the same.
16:23
Okay, and when does minimizer's preference change? At the places where the frontiers of the exits intersect. In this case, the frontiers intersect at the 45 degree direction, and this is where the preferences change. Above it, minimizer prefers going up. Below it, minimizer prefers going down.
16:44
In general, so in higher dimensions, the basic notion is the same. We take the intersection of the frontiers and look for regions between the points of intersection. Take this three-dimensional example. One exit is this plane, which can reach every single target surely, but exclusively.
17:05
And then it can randomize between the possibilities. For example, here in the middle, it gets one-third for every target. The other exit is this box. It reaches all targets with probability 0.7.
17:21
Intersecting the two frontiers gives us this picture. Intuitively, here at the corners, the blue exit is better, as it can reach a single target surely. But here in the middle, the red exit is better, as it can reach 0.7 for every target, while the blue exit can only randomize and reach one-third.
17:42
Note that I just said better. This means better for us maximizing people. Minimizer, of course, has the opposite objective. So, in these three regions where we see the blue frontier, minimizer plays the red exit, and here in the middle, minimizer plays the blue exit. So, again, we have decomposed the picture into regions where we know what minimizer will do.
18:07
And this is how far you get with pictures and intuitive explanations. In higher dimensions, you need some geometric notions like projective hyperplane, triangulations, and simplicial complex to describe things properly, but everything generalizes nicely.
18:24
So, let's go through all the things you heard about once more, so that you can put everything into context. We want to compute an under- and over-approximation of the Pareto frontier. We initialize the under-approximation L as the point with zeros in every dimension, and the over-approximation U as the hypercube with ones in every dimension.
18:46
Then, we do value iteration until these approximations are epsilon close to each other. Value iteration means, do the generalized Bellman update for L and for U, and then do the new things in order to make the over-approximation converge even for the evil end components.
19:06
The new things are, first, find the regions by computing the intersection of all exits. Then, for each region, do the same thing as in the single-dimensional case. Find the simple end component, deflate to the best exit, and then finally, piece together the whole frontier.
19:25
And this is a convergent anytime algorithm for approximating the values in generalized reachability stochastic games. And by giving it, we have shown that it is indeed possible to approximate the Pareto frontier to arbitrary precision.
19:41
However, we still don't know whether it is possible to compute the exact Pareto frontier, and thus decidability of generalized reachability in stochastic games remains open. Thank you for listening, and goodbye.