We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Policy iteration - Algorithm

00:00

Formal Metadata

Title
Policy iteration - Algorithm
Title of Series
Number of Parts
11
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
NeuroinformatikMarkov decision processXMLComputer animation
AlgorithmIterationPerformance appraisalLoop (music)EstimationGroup actionMaxima and minimaThresholding (image processing)Parameter (computer programming)Function (mathematics)DeterminismGroup actionCASE <Informatik>Function (mathematics)State of matterFunctional (mathematics)Markov decision processInfinityComputer animation
EstimationGroup actionLoop (music)Maxima and minimaIterationAlgorithmPerformance appraisalThresholding (image processing)Parameter (computer programming)Function (mathematics)DeterminismGroup actionComputer animation
IterationPerformance appraisalEstimationGroup actionLoop (music)Maxima and minimaComputer animation
State of matterGroup actionAxiom of choiceApproximationIterationUtility softwareGamma functionSummierbarkeitRange (statistics)Markov decision processTexture mappingLetterpress printingMaxima and minimaPerformance appraisalLoop (music)EstimationState of matterNeuroinformatikPerformance appraisal2 (number)MereologyComputer animation
State of matterGroup actionAxiom of choiceApproximationUtility softwareIterationGamma functionSummierbarkeitRange (statistics)Markov decision processTexture mappingLetterpress printingFile formatMaxima and minimaLoop (music)AlgorithmAxiom of choicePerformance appraisalConnectivity (graph theory)ApproximationsalgorithmusState of matterUtility softwareGroup actionMarkov decision processMappingData storage deviceSummierbarkeitDivisorDiscounts and allowancesMereologySet (mathematics)Functional (mathematics)Computer animation
AlgorithmIterationPerformance appraisalLoop (music)EstimationGroup actionMaxima and minimaSet (mathematics)State of matterComputer animation
IterationMarkov decision processState of matterAxiom of choiceLetterpress printingFile formatGroup actionMaxima and minimaAlgorithmPerformance appraisalLoop (music)EstimationAlgorithmGraph (mathematics)Maxima and minimaExpected valueQuicksortGroup actionState observerSet (mathematics)Functional (mathematics)Performance appraisalNetwork topologyComputer animationJSONXMLUML
Policy iteration is similar. We optimize not only values for states, but the key values. Policy iteration is the following again, infinite loop, the same as we saw for value iteration,
but then we not simply output a deterministic policy, but we are trying to improve the policy. Thus, the policy is a function that given a state returns an action.
So for each state, we remember all action and then try to recompute new optimal possible action. And if old action is equal to current, then the policy is stable and we go with that. It is fine. It's actually converges to the previous case where we just went or we got.
But if the policy is not stable, new way of action is different from previous. We conclude that the policy is not stable and we go again to a step two. So the step two and three are repeated until the stable policy is achieved.
Now let's check how it is implemented. There are two big parts. First is evaluation of the policy and second improvement of the policy. Evaluation is essentially computation of state values for every state. Here it is immediate reward plus discount factor per sum of an old policy multiplied
per values for all policy state pairs in a transition function. And then we return an updated utility mapping U from each state in the MDP to its utility using an approximation. It's a component of a policy iteration.
See it is here, policy evaluation. We evaluate how good the policy is by computing all the values. And now let's start from the beginning. So we initialize the state values where the zeros, just to begin with something, then P is a policy. We start with a random choice and then try to improve it. This loop is familiar to you, this part we already discussed, and then here is the implementation
of algorithm. First, we assume that the policy isn't changed like here, then for each state in the set of available state, we compute optimal action to be taken in this state.
And if this optimal action is not equal to existing policy, we update the policy. Again, policy is the function where you give state and receive actions. So it's a set of best actions. And here we update the policy and here we implement this algorithm.
So far, the only, not the question, but the observation is that this algorithm of policy evaluation and improvement looks like a sort of expectation maximization algorithm and the graph search. Because we are starting with the zeros and then just iterate and update the values. Exactly. You are completely right.
Okay.