We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback
00:00

Formal Metadata

Title
Policy
Title of Series
Number of Parts
11
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
File formatNeuroinformatikXMLComputer animation
Mathematical optimizationDefault (computer science)Function (mathematics)Group actionIntegrated development environmentComputer networkWeightDeterminismDistribution (mathematics)Reinforcement learningTexture mappingIntegrated development environmentNumberArtificial neural networkProbability distributionTask (computing)Group actionParameter (computer programming)Mathematical optimizationState of matterWave packetRadical (chemistry)RhombusCellular automatonMultiplication signEndliche ModelltheorieSet (mathematics)Functional (mathematics)WeightMedical imagingDeterminismIterationOcean currentDistribution (mathematics)ResultantInteractive televisionMaxima and minimaMappingSummierbarkeitSequenceStochasticComputer animationDiagramJSONXMLUML
What is policy? The policy is a way of acting. It's a function. You give it a state and it returns the action.
Agents start with some non-optimal policy, performing actions which don't lead to a desired outcome with time, iterations and optimization, we achieve optimal policy. This is a different concept. There is a policy and optimal policy.
Policy is just a way of acting of agent in current environment, nothing about its optimality. Optimal policy is a set of actions like you see here, and this set of actions maximize the reward in optimal number of steps. Let's take a closer look at this image. Here is the simple grid environment.
Here is the state you cannot enter. This is the desired state, which gives us plus 100 reward. This is highly undesirable state, which gives us minus 100 reward. The agent is spawned in a random cell of a grid. For each cell, we want to know what is the sequence of actions.
The agent has to perform to collect maximum sum of reward, but what is the reward? You asked me. I don't get any reward being in any of these states, right? I only get reward being here and penalize being here.
So I will get a reward only when the episode ends and episode ends when the agent achieves either this or this state. The agent could learn to randomly roll to the left and to the right. So I need to do something to force agent to minimize the number of steps. Let's penalize agent for every step minus one, a tiny penalty comparing to the huge
benefit and huge penalty, but it will result in agent trying to minimize number of steps. Now it will move fast. It won't circulate between two states eternally. It will try to move to the desired objective as fast as possible.
And in front of you, you see the optimal, the learned policy. This function for each state returns action, which we have to take to achieve maximum sum of rewards. So to summarize, the policy is simply a function. It receives state returns action.
We start with non-optimal policy. It looks like initialization of weights in neural network. The policy is learned iteratively during iteration of agent with the environment. We also call it training. The training is split into episodes. Each episode ends with reaching terminal states, one of the diamond or this death cell.
Our goal in any deep RL task is to find the policy that returns largest possible reward for each available action. It is denoted as key star. The parameter is S the outcome is X for all states.
We have to know optimal actions. Now if you think of policy as a neural network, it could work in unlimited environments. This environment is limited and simplified. We don't need a neural network here or any kind of a model. We can represent our policy as a two dimensional matrix, three by four, compute number for
each cell and be happy, but what would we do if the environment itself changes or there are some expected states, which we don't know. We have to learn the policy. Then we have to put a model instead of a policy to approximate those new unknown states.
There are deterministic and stochastic policies. Stochastic policy return one action for one state. Let's think about this example. The agent spawns in any of the cells of this world. And his goal is to get home cell taking as least steps as possible here already.
You can see optimal policy. Like if we appear here, go right, go up, go right. If we appear here, go right, go right, go right. If we appeared here, go up. If we appeared here, go right, go right, go up, go right.
Deterministic policy, one state, one action. For stochastic policy for states, we rather receive a distribution of probabilities and the higher of probability of the action, the more we call it optimal action. If this stochastic policy will converge again, it's an optimal policy for the same case,
we will see the following picture. For these states, there are no options. There are no second optimal pathways. We only go right. So the probability of going right in these cells will be 100%.
On the contrary, in the second line, there will be probability to move up or right 50, 50%. Why? Because it doesn't matter. It doesn't matter. Either we go up, we will do the same number of steps. Or if we go right in these three cells, we will achieve our home in the same number of
steps. The reward will be equal. That's why the probabilistic distribution of optimal action was this. Formally, a policy is a mapping from states to probabilities of selecting each possible action from the set of actions of MDP.
If the agent is following policy P at time T, then PA from S is the probability that at this time step T, the chosen action will be A and the chosen state will be S. P is an ordinary function. The pipe in the middle of A and S reminds us that it defines a probability distribution
over all available actions for each state. Here for each action, we have a distribution of probabilities that this action will be chosen by a policy. So RL is all about how the agent policy is changed and learned as a result of interaction
of agent and environment, also called experience. Thank you.