We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Value Functions

00:00

Formal Metadata

Title
Value Functions
Title of Series
Number of Parts
11
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Service (economics)Function (mathematics)NeuroinformatikExpected valueState of matterGroup actionBellman equationGroup actionState of matterType theoryIntegrated development environmentFunctional (mathematics)Artificial neural networkSet (mathematics)SummierbarkeitWordApproximationXMLComputer animation
State of matterLimit (category theory)Exponential functionBit rateRecursionDiscounts and allowancesExpected valueFunction (mathematics)Group actionNichtlineares GleichungssystemDivisorIntegrated development environmentMultitier architectureBellman-GleichungSummierbarkeitExpected valueOnline helpState of matterBellman equationInteractive televisionPosition operatorFunctional (mathematics)ResultantBellman-GleichungCalculationMaß <Mathematik>Decision theoryOcean currentGroup actionMathematical optimizationIntegrated development environmentExpressionRepresentation (politics)CoefficientCASE <Informatik>Computer animationDiagram
Nichtlineares GleichungssystemIntegrated development environmentDivisorDiscounts and allowancesFunction (mathematics)State of matterBellman-GleichungExponential functionLimit (category theory)Bit rateRecursionMarkov decision processMathematical optimizationGroup actionEquivalence relationWell-formed formulaExpressionTerm (mathematics)Gamma functionRecursionSummierbarkeitProjective planeState of matterWell-formed formulaGroup actionNichtlineares GleichungssystemUtility softwareIntegrated development environmentFunctional (mathematics)Mathematical optimizationRadical (chemistry)Interactive televisionCellular automatonBellman equationGraph (mathematics)Decision theoryMereologyTupleP (complexity)StochasticPhase transitionNeuroinformatikSet (mathematics)CoefficientOptimization problemGame controllerParameter (computer programming)Computer clusterRandomizationDiagramOcean currentBellman-GleichungGraph (mathematics)Computer animationDiagramJSONXMLUML
Value function and what is value? Values is a specific word for expected rewards and value function is a function which evaluates
them for the environment we have, for the policy we have, and for the set of states we have. There are two types of value functions, state value function and action value function. All of the neural networks and stuff under the hood compute these functions or approximate.
Remember this formula, a sum of a discounted future rewards. We talked about it earlier. Remember here it is. I told you that there is a trick to evaluate some of expected future rewards without knowing
end of the episode or with introduction of gamma, a decay coefficient. Here is the expression for that. And here is the recursive expression for the calculation of some of the future rewards. We will use two of them.
First you will see this, then you will see this. And when I substitute this formulation of some of the reward with this, I will call this function Bellman equation. Don't be scared. Everything is simple. State value says us how good will be to move to the next state from current.
Imagine your agent is here and he's thinking, would it be better to move here or here? And we need to somehow express these two states for this agent to make a decision. The expected reward here and here will be the same. The probability and the policy will adapt accordingly.
This is the result, but to come to this result, there are many intermediate steps. We will get to them important. The unit of measure for all values are reward. So these are expression over expected reward. Expected is how much I will get, not in the current position, but in the next position.
Both of them of this values help agent to make a decision. Value function takes into account only next state. See we starting computations from plus one. We are not considering current state. We don't care. We want only some of the future rewards and action value takes into account, not only
future states, but also actions which are to be taken to get into those new states. Both of these values calculated from experience and experience is interaction of agent and environment. There will be many, many episodes after which we will compute the optimal state value function
and optimal action value function. Now what is the Bellman equation? Then there are two Bellman equations for two cases for state value function and for action value function. You could notice that if we simply substitute this with recursive representation of a discounted
future rewards, we will get Bellman equation from this. Now let's go a bit deeper. Bellman equation is an instrument which allows us to compute the state values for each cell
of the grid of our world. It tells us how to find the value of a state as following some policy P. It's a probability to choose particular action being in some particular state. Policies are nonoptimal from the start and we are learning policy by making a lot of
mistakes by interaction with environment. We are in the state S and we have a set of actions to do possible, which are available in this state for an agent, A1, A2, A3. Here where the sum comes from, this is the sum for every possible actions in this state.
So there will be three Ps and P again is a probability that being in this state agent will choose this action or this action or this action. For example, here will be 0.3, 0.1 and 0.6.
This part is recursive. It consists of immediate reward, the reward that we get immediately by taking one of the three actions and we will evaluate it for each action transition function. It introduces some randomness and stochasticity guarantee exploration so that agent doesn't
stuck on first good pathway it has found. We are giving a chance to explore better than the known pathways. And this stochasticity can be seen on this diagram as this split. There is a chance, 90% chance that taking this action in this state, we will get into
this state, but also there is a 10 chance that we will end up in this state. It's immediate reward, immediate reward, immediate reward, decay coefficient or gamma multiplied by the sum of expected reward. Now let's project this knowledge to this formula. We see immediate reward and decay coefficient or HAMA multiplied by the expected sum of
future reward moving according to the same policy. So this term is how much reward we will get. We cannot compute it until we reach the end of the episode. During the episode, agent actually doesn't have values for states.
It computes the values for all of the states when it reaches the terminal state. Until then, it's a long graph of dependencies of calculations, which are to be performed when the final value is received and the final value is received at the end of the
episode. After that, those graphs quickly back what propagates, this way we can compute all the values for all the states. To review what is action values, we have to first understand the concept of Q states. Earlier, we were in the world where only states and actions and new states, and we make a
decision based only on states, current state and future state. But there is another way. Imagine we are in this state and we are selecting the pair of action and the new state. This tuple of the state and action is called Q state.
Q state is expected reward being in the state S and performing action A. On the contrary, in this case, we're telling if I'm in this state and going to be in that state, the reward will be. We are omitting this phase of stochasticity of actions.
In this setup, we tell what could be the reward if being in this state, we took action A. So we are evaluating the reward not based on the final state, but based on the pair of current state and action. Here we depicted it. In this setup for state value function, we computed value function for this future expected
state. We took into account probabilities of taking actions and there was a stochasticity introduced after we chose the action. In fact, we expected new states, but we had no control over reaching this final state.
In this case, we are evaluating reward based on the pair of action that we are going to perform. So the expected reward is computed here for this state. We compute Q value and for the successor state, we compute state value function. In the same way as the state function, this equation tells us how to find recursively
the value of a state action pair following a policy P. Now pay attention parameters here were on the state. Here the parameter state and action, and we, again, summarize all possible next states
comma multiplied by a sum of reward. If we write it recursively, we will get the final expression for the action value function. So having state and action pair, we can evaluate how good it will be taking some particular
action in that state recursively, since the state value function is equivalent to the sum of the action value functions of all outgoing actions, a superscript multiplied by the policy. We could rewrite this formula as such. We reviewed expressions for a utility state value and action value functions, but what
we want to achieve is optimal policy and optimal policy is the one that leads to maximum total collected reward policy could be expressed as optimal state value function or optimal
action value function. I will compute the value for each of the moves and choose largest value. It will be optimal value functions or optimal action value. Let's combine everything we learned. If I would want to compute optimal values, I would use Bellman equations.
There are Bellman equation for a state value function and for action value function. Optimal value function, they are just the largest from the available values. And then we just rewrote the sums with this max sign. Yes, we will have to compute all of them, but then we want the largest.
Thank you.