Optimal Policy and Max Total Reward
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 11 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/63106 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
1
3
5
6
7
8
10
00:00
Mathematical optimizationNeuroinformatikGroup actionFunction (mathematics)Bellman-GleichungBellman equationLoop (music)AlgorithmIterationThresholding (image processing)EstimationParameter (computer programming)DeterminismMaxima and minimaMathematical optimizationGroup actionState of matterMarkov decision processRadical (chemistry)Bellman equationParameter (computer programming)Default (computer science)Limit (category theory)WritingNichtlineares GleichungssystemDiscounts and allowancesNeuroinformatikSet (mathematics)EstimatorThresholding (image processing)Ocean currentAlgorithmRecursionMaxima and minimaNumeral (linguistics)Image resolutionNumberFunctional (mathematics)Bellman-GleichungXMLComputer animation
02:03
IterationAlgorithmLoop (music)Thresholding (image processing)Parameter (computer programming)EstimationFunction (mathematics)DeterminismMaxima and minimaGroup actionMaxima and minimaCondition numberState of matterReal numberDeterminismDifferent (Kate Ryan album)AlgorithmComputer animation
02:38
Reinforcement learningIntegrated development environmentSocial classProcess (computing)Markov chainCategory of beingDecision theoryState of matterMarkov decision processOperator (mathematics)Random numberComputer fontGroup actionGamma functionTrailFunction (mathematics)AlgorithmData modelElectronic mailing listLetterpress printingSound effectDefault (computer science)Set (mathematics)Key (cryptography)Distribution (mathematics)Observational studyPoint (geometry)FacebookElectric currentIterationStudent's t-testMatrix (mathematics)Data dictionaryReading (process)Instance (computer science)MUDMaxima and minimaGroup actionFunctional (mathematics)Reinforcement learningMarkov decision processState of matterSocial classMarkov chainStudent's t-testSet (mathematics)Radical (chemistry)Integrated development environmentImplementationFacebookException handlingElectronic mailing listDiscounts and allowancesDecision theoryPerspective (visual)Instance (computer science)Observational studyVideo gameComputer animationXMLDiagramProgram flowchart
06:05
AlgorithmThresholding (image processing)EstimationParameter (computer programming)Loop (music)DeterminismFunction (mathematics)Maxima and minimaBellman equationGroup actionLoop (music)State of matterSet (mathematics)Computer animation
06:14
Group actionIterationMarkov decision processGamma functionState of matterRange (statistics)Maxima and minimaLoop (music)InfinityGroup actionDiscounts and allowancesMarkov chainComputer animationXML
06:27
AlgorithmFunction (mathematics)EstimationThresholding (image processing)Loop (music)Polar coordinate systemMaxima and minimaIterationGroup actionGamma functionMarkov decision processState of matterRange (statistics)Markov decision processInterior (topology)Loop (music)Condition numberLimit of a functionMathematical optimizationComputer animation
06:43
AlgorithmIterationThresholding (image processing)Parameter (computer programming)Loop (music)Function (mathematics)DeterminismGroup actionMarkov decision processGamma functionRange (statistics)State of matterMaxima and minimaCore dumpState of matterNeuroinformatikComputer animationXML
07:03
Group actionGamma functionState of matterUtility softwareMarkov decision processFunction (mathematics)Maxima and minimaIterationAlgorithmThresholding (image processing)Parameter (computer programming)Loop (music)DeterminismStudent's t-testMathematical optimizationComputer animation
07:12
Group actionMarkov decision processIterationState of matterFunction (mathematics)Utility softwareMaxima and minimaGamma functionPlot (narrative)Key (cryptography)Time evolutionMathematical optimizationPhase transitionState of matterSubject indexingComputer animationXMLDiagramJSONUML
Transcript: English(auto-generated)
00:00
So, having the Balmain equations as a way to compute action values and data values, let's actually compute them.
00:21
How should we compute that in a loop, of course. The ratio algorithms that we're going to talk about now, value iteration and policy iteration. They are nothing else than a way to solve Balmain optimality equations for state value and action value functions, numerically. They converge when optimal policy is reached.
00:43
We start with some non-optimal policy, iteratively try to improve the values and then converge with the optimal for this set of states and action policy that we reached. Here is the algorithm for value iteration for estimating policy, which is optimal. There are a number of parameters.
01:01
We will have a small threshold larger than zero, determine desired accuracy of estimation of the optimal policy. Then we initialize values for all the states with some defaults. Then for all states, except the terminal value of the state is zero. Here is the algorithm itself.
01:21
I will comment it and then show the implementation of it. So in a loop, until the convergence limit is reached, do the following for all states, write previous value of current state then, and here the Balmain equation for state value
01:40
function comes into play. For this state, we compute the new value function as following. First we compute for all the next possible states, taking into account transition function, compute recursively immediate reward plus discount multiplied by a recursive value of
02:03
next state values computed that for all possible actions, we select the maximum value that we can achieve. And this way for this particular state, we update the value with the new one. Then we compute the difference between old values and newly computed values to decide
02:22
whether we should stop. This algorithm outputs a deterministic policy such that policy or what should we do in each state is action that maximizes this condition. Let's consider sub real world problem and check how this problem is implemented there.
02:42
An example MDP would be a student deciding what to do in his life. There are terminal states, there are transition between states, there are gamma decay coefficient, there are reward, there are function that returns a numeric word for state. This is a transition function.
03:00
It returns a chance for the action to be taken for each state set of actions that can be performed in this state. By default, it's a fixed list of actions except for terminal state. For terminal, it will be none. We can get dates from transitions, essentially just multiplying the chance of a transition
03:20
and the transition function. Just on MDP is a graduate student's dilemma. Here how it looks like appear in one of the states. The states is leisure and classes. The harder the class, the more reward we get for it and we get penalized for being lazy scrolling Facebook.
03:40
And the set of action is different for each state, as you see. Also, there is a terminal state, you end with sleeping or studying. What is transition matrix? As I told you before, for each state, and we have four states, we have a chance that making a decision to move to that state, we actually receive it. So going leisure from Facebook will be 90%, going class from Facebook will be 10% and
04:05
quitting, on the contrary. Then for the class one, going study from class one to class two is 60% probability going to leisure from class one is 40% probability and the same for rest. Not all actions available for each state.
04:22
The actions that are unavailable for the state simply has no transition probability. So it is impossible to perform this action in this state. Here is the reward function. Reward function is simplified. It is addict. Since it's episodic task, the reward being in the terminal state is zero.
04:41
We record this state as terminals. We're setting our initial state being in the class one. We are creating the instance of MDP that I showed you earlier. We set the discount being 0.95. It will give an agent a pretty nice future perspective to consider his future actions.
05:04
Now we're just checking what do we have? We have the following set of states, one, two, three, four, five states. We have different actions for different states. As I told you, due to the transition function, some of them are simply unavailable because
05:22
they don't have a chance to transit. Then here is action which are available in a terminal state, none.
05:41
And here is the reward schema. Now knowing our environment and the goal, we want to maximize reward. We want to go to the final class and to graduate. We want to went back from leisure to graduate.
06:02
It is the implementation of value iteration. Let's get back to schema to refresh it. We are looping over infinite loop, calculating value function, and then choosing a P based on a best action from the set of possible. Here we record all the states. Then we read in the reward schema, the transition probabilities, and the discount.
06:24
Then goes this infinite loop. This one, the outer loop. We perform the outer loop until Delta becomes smaller than some predefined epsilon. It's a condition to interrupt the value iteration and return optimal policy.
06:43
This one, the core is computing the values for each state, and it is very direct here for each state in all of the state compute new value for each state by adding immediate reward. Here you have immediate reward and then discounted some of future rewards.
07:03
And this is for plotting to record how did it change and then best policy, this row, the best policy, how should student act optimally? And we are selecting simple, the indexes for the states with the largest first, the policy
07:21
is not optimal, but after some computations, it converges to optimal. And if we plot the values for each state over time, we see that on every iteration, the values for every state changes to some comparable values based on which we can build
07:41
our path of optimality. Okay.
Recommendations
Series of 3 media