Policy iteration - Algorithm
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 11 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/63105 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
1
3
5
6
7
8
10
00:00
NeuroinformatikMarkov decision processXMLComputer animation
00:13
AlgorithmIterationPerformance appraisalLoop (music)EstimationGroup actionMaxima and minimaThresholding (image processing)Parameter (computer programming)Function (mathematics)DeterminismGroup actionCASE <Informatik>Function (mathematics)State of matterFunctional (mathematics)Markov decision processInfinityComputer animation
00:57
EstimationGroup actionLoop (music)Maxima and minimaIterationAlgorithmPerformance appraisalThresholding (image processing)Parameter (computer programming)Function (mathematics)DeterminismGroup actionComputer animation
01:02
IterationPerformance appraisalEstimationGroup actionLoop (music)Maxima and minimaComputer animation
01:21
State of matterGroup actionAxiom of choiceApproximationIterationUtility softwareGamma functionSummierbarkeitRange (statistics)Markov decision processTexture mappingLetterpress printingMaxima and minimaPerformance appraisalLoop (music)EstimationState of matterNeuroinformatikPerformance appraisal2 (number)MereologyComputer animation
01:36
State of matterGroup actionAxiom of choiceApproximationUtility softwareIterationGamma functionSummierbarkeitRange (statistics)Markov decision processTexture mappingLetterpress printingFile formatMaxima and minimaLoop (music)AlgorithmAxiom of choicePerformance appraisalConnectivity (graph theory)ApproximationsalgorithmusState of matterUtility softwareGroup actionMarkov decision processMappingData storage deviceSummierbarkeitDivisorDiscounts and allowancesMereologySet (mathematics)Functional (mathematics)Computer animation
02:29
AlgorithmIterationPerformance appraisalLoop (music)EstimationGroup actionMaxima and minimaSet (mathematics)State of matterComputer animation
02:34
IterationMarkov decision processState of matterAxiom of choiceLetterpress printingFile formatGroup actionMaxima and minimaAlgorithmPerformance appraisalLoop (music)EstimationAlgorithmGraph (mathematics)Maxima and minimaExpected valueQuicksortGroup actionState observerSet (mathematics)Functional (mathematics)Performance appraisalNetwork topologyComputer animationJSONXMLUML
Transcript: English(auto-generated)
00:00
Policy iteration is similar. We optimize not only values for states, but the key values. Policy iteration is the following again, infinite loop, the same as we saw for value iteration,
00:26
but then we not simply output a deterministic policy, but we are trying to improve the policy. Thus, the policy is a function that given a state returns an action.
00:41
So for each state, we remember all action and then try to recompute new optimal possible action. And if old action is equal to current, then the policy is stable and we go with that. It is fine. It's actually converges to the previous case where we just went or we got.
01:02
But if the policy is not stable, new way of action is different from previous. We conclude that the policy is not stable and we go again to a step two. So the step two and three are repeated until the stable policy is achieved.
01:20
Now let's check how it is implemented. There are two big parts. First is evaluation of the policy and second improvement of the policy. Evaluation is essentially computation of state values for every state. Here it is immediate reward plus discount factor per sum of an old policy multiplied
01:44
per values for all policy state pairs in a transition function. And then we return an updated utility mapping U from each state in the MDP to its utility using an approximation. It's a component of a policy iteration.
02:01
See it is here, policy evaluation. We evaluate how good the policy is by computing all the values. And now let's start from the beginning. So we initialize the state values where the zeros, just to begin with something, then P is a policy. We start with a random choice and then try to improve it. This loop is familiar to you, this part we already discussed, and then here is the implementation
02:25
of algorithm. First, we assume that the policy isn't changed like here, then for each state in the set of available state, we compute optimal action to be taken in this state.
02:42
And if this optimal action is not equal to existing policy, we update the policy. Again, policy is the function where you give state and receive actions. So it's a set of best actions. And here we update the policy and here we implement this algorithm.
03:01
So far, the only, not the question, but the observation is that this algorithm of policy evaluation and improvement looks like a sort of expectation maximization algorithm and the graph search. Because we are starting with the zeros and then just iterate and update the values. Exactly. You are completely right.
03:41
Okay.