We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Policy iteration - Algorithm

00:00

Formale Metadaten

Titel
Policy iteration - Algorithm
Serientitel
Anzahl der Teile
11
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
NeuroinformatikMarkov-EntscheidungsprozessXMLComputeranimation
AlgorithmusIterationLeistungsbewertungLoopSchätzungGruppenoperationLokales MinimumSchwellwertverfahrenParametersystemFunktion <Mathematik>Deterministischer ProzessGruppenoperationCASE <Informatik>Funktion <Mathematik>AggregatzustandFunktionalMarkov-EntscheidungsprozessUnendlichkeitComputeranimation
SchätzungGruppenoperationLoopLokales MinimumIterationAlgorithmusLeistungsbewertungSchwellwertverfahrenParametersystemFunktion <Mathematik>Deterministischer ProzessGruppenoperationComputeranimation
IterationLeistungsbewertungSchätzungGruppenoperationLoopLokales MinimumComputeranimation
AggregatzustandGruppenoperationAuswahlaxiomApproximationIterationSystemprogrammGammafunktionSummierbarkeitSpannweite <Stochastik>Markov-EntscheidungsprozessTextur-MappingHochdruckLokales MinimumLeistungsbewertungLoopSchätzungAggregatzustandNeuroinformatikLeistungsbewertungZweiMereologieComputeranimation
AggregatzustandGruppenoperationAuswahlaxiomApproximationSystemprogrammIterationGammafunktionSummierbarkeitSpannweite <Stochastik>Markov-EntscheidungsprozessTextur-MappingHochdruckDateiformatLokales MinimumLoopAlgorithmusAuswahlaxiomLeistungsbewertungZusammenhängender GraphApproximationsalgorithmusAggregatzustandSystemprogrammGruppenoperationMarkov-EntscheidungsprozessMapping <Computergraphik>InformationsspeicherungSummierbarkeitTeilbarkeitKondition <Mathematik>MereologieSchnittmengeFunktionalComputeranimation
AlgorithmusIterationLeistungsbewertungLoopSchätzungGruppenoperationLokales MinimumSchnittmengeAggregatzustandComputeranimation
IterationMarkov-EntscheidungsprozessAggregatzustandAuswahlaxiomHochdruckDateiformatGruppenoperationLokales MinimumAlgorithmusLeistungsbewertungLoopSchätzungAlgorithmusGraphLokales MinimumErwartungswertQuick-SortGruppenoperationLuenberger-BeobachterSchnittmengeFunktionalLeistungsbewertungTopologieComputeranimationJSONXMLUML
Policy iteration is similar. We optimize not only values for states, but the key values. Policy iteration is the following again, infinite loop, the same as we saw for value iteration,
but then we not simply output a deterministic policy, but we are trying to improve the policy. Thus, the policy is a function that given a state returns an action.
So for each state, we remember all action and then try to recompute new optimal possible action. And if old action is equal to current, then the policy is stable and we go with that. It is fine. It's actually converges to the previous case where we just went or we got.
But if the policy is not stable, new way of action is different from previous. We conclude that the policy is not stable and we go again to a step two. So the step two and three are repeated until the stable policy is achieved.
Now let's check how it is implemented. There are two big parts. First is evaluation of the policy and second improvement of the policy. Evaluation is essentially computation of state values for every state. Here it is immediate reward plus discount factor per sum of an old policy multiplied
per values for all policy state pairs in a transition function. And then we return an updated utility mapping U from each state in the MDP to its utility using an approximation. It's a component of a policy iteration.
See it is here, policy evaluation. We evaluate how good the policy is by computing all the values. And now let's start from the beginning. So we initialize the state values where the zeros, just to begin with something, then P is a policy. We start with a random choice and then try to improve it. This loop is familiar to you, this part we already discussed, and then here is the implementation
of algorithm. First, we assume that the policy isn't changed like here, then for each state in the set of available state, we compute optimal action to be taken in this state.
And if this optimal action is not equal to existing policy, we update the policy. Again, policy is the function where you give state and receive actions. So it's a set of best actions. And here we update the policy and here we implement this algorithm.
So far, the only, not the question, but the observation is that this algorithm of policy evaluation and improvement looks like a sort of expectation maximization algorithm and the graph search. Because we are starting with the zeros and then just iterate and update the values. Exactly. You are completely right.
Okay.