We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Automated Reinforcement Learning

00:00

Formal Metadata

Title
Automated Reinforcement Learning
Title of Series
Number of Parts
18
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Reinforcement learningAutomationMultiplication sign2 (number)InformationSpacetimeComputer animationLecture/ConferenceMeeting/Interview
Integrated development environmentReinforcement learningTemporal logicGroup actionOpen setLaceFreewareWordBlogScherbeanspruchungWeb pageState of matterRight angleState observerEstimatorReinforcement learningDemo (music)Group actionMathematicsSlide ruleLoop (music)Interactive televisionIntegrated development environmentRandom walkComputer animationLecture/Conference
Case moddingWebsiteFile formatLikelihood-ratio testTemporal logicGroup actionSet (mathematics)Reinforcement learningWhiteboardGame theoryArtificial neural networkVideoconferencingVideo gameResultantSurface of revolutionDifferent (Kate Ryan album)Decision theoryTask (computing)1 (number)Functional (mathematics)ApproximationGame controllerComputer animationLecture/Conference
Cross-site scriptingMedical imagingGame controllerPixelTask (computing)outputSampling (statistics)Domain nameSpacetimeState of matterComputer animationMeeting/InterviewLecture/Conference
Thermische ZustandsgleichungSimulationoutputMaß <Mathematik>Multiplication signPixelDomain nameState observerComputer animationLecture/ConferenceMeeting/Interview
Time domainHypercubeReinforcement learningParameter (computer programming)VarianceSource codeFocus (optics)GradientControl flowFunction (mathematics)Cartesian coordinate systemGoogolIntegrated development environmentRandom numberDomain nameMetric systemStatistical dispersionEmailAbstractionArtificial intelligenceAlgorithmTask (computing)Group actionExpert systemDynamical systemBenchmarkOrder of magnitudeMorley's categoricity theoremDistribution (mathematics)Range (statistics)Process (computing)Read-only memoryProduct requirements documentVotingData modelOpen setNetwork topologyPersonal digital assistantSample (statistics)Stability theoryTable (information)Address spaceBEEPGame controllerMereologyAlgorithmSoftware testingExpected valueSelectivity (electronic)SubsetAxiom of choiceHeuristicReinforcement learningSensitivity analysisSupervised learningSoftwareDynamical systemIntegrated development environmentCodeDomain nameType theoryGroup actionCurveScaling (geometry)Graph (mathematics)File archiverAverageMoment (mathematics)Computer animationLecture/Conference
HypercubeParameter (computer programming)Data modelIterationPole (complex analysis)Decision tree learningIntegrated development environmentMathematical analysisSupervised learningAxiom of choiceMarkov decision processIterationDifferent (Kate Ryan album)Bellman equationEndliche ModelltheoriePoint (geometry)Configuration spaceGroup actionDivisorParameter (computer programming)Gamma functionBit rateInsertion lossState observerGradientPredictionArtificial neural networkProcess (computing)Mathematical optimizationLoop (music)Reinforcement learningSpacetimeEstimatorState of matterComputer animation
Mathematical optimizationExecution unitComa BerenicesGoogolArtificial intelligenceMeta elementReinforcement learningArchaeological field surveyOpen setAutomationComputer networkIntegrated development environmentAlgorithmParameter (computer programming)Group actionState of matterFunction (mathematics)Task (computing)AlgorithmParameter (computer programming)Vector potentialSoftwareStack (abstract data type)Integrated development environmentTask (computing)MereologyCombinational logicMultiplication signGroup actionSpacetimeDifferent (Kate Ryan album)Element (mathematics)State observerFigurate numberSupervised learningType theorySocial classReinforcement learningMathematical optimizationArchaeological field surveyElectronic program guideMedical imagingEndliche ModelltheorieComputer animationLecture/ConferenceMeeting/Interview
Digital-to-analog converterConfiguration spacePressureTorusLevel (video gaming)Wave packetEquivalence relationApproximationQuery languageParameter (computer programming)AerodynamicsHypercubeMathematical optimizationSample (statistics)Partial derivativeDynamical systemSampling (statistics)Level (video gaming)Integrated development environmentTask (computing)Supervised learningVideo gameGleichverteilungConfiguration spaceSet (mathematics)Bounded variationSpacetimeWave packetState observerReinforcement learningInsertion lossBayesian networkScheduling (computing)Extension (kinesiology)Multiplication signPartial derivativeTunisGroup actionApproximationInformation1 (number)Mathematical optimizationExterior algebraComputer configurationStrategy gameRun time (program lifecycle phase)SupercomputerParallel portDifferent (Kate Ryan album)Performance appraisalState of matterSoftware frameworkTerm (mathematics)WeightLatent heatXMLComputer animationLecture/Conference
Performance appraisalWave packetReinforcement learningNeuroinformatikComputer animationLecture/Conference
RoboticsWell-formed formulaPerformance appraisalResultantImplementationIntegrated development environmentDiscounts and allowancesCoefficient of determinationReinforcement learningSampling (statistics)Task (computing)Configuration spaceMetric systemPhysical systemMathematical optimizationMultiplication signPoint (geometry)Similarity (geometry)Different (Kate Ryan album)SequenceDivisorEstimatorMereologyDistribution (mathematics)Wave packetIterationAverageLoop (music)StochasticServer (computing)Level (video gaming)AlgorithmInformationBounded variationInteractive televisionDirection (geometry)Computer animation
Parameter (computer programming)AlgorithmHypercubeCASE <Informatik>Integrated development environmentGoodness of fitTerm (mathematics)Observational studyUltraviolet photoelectron spectroscopyMathematical optimizationTask (computing)Virtual machineSubsetFormal languageImplementationBenchmarkMultiplication signResultantSelectivity (electronic)Black boxComputer clusterBit rateAlgorithmConfidence intervalBayesian networkSet (mathematics)Reinforcement learningTable (information)Performance appraisalPlotterDifferent (Kate Ryan album)Right angleSensitivity analysisTotal S.A.Transport Layer SecurityConfiguration spaceRandomizationDynamical systemFreewarePoint (geometry)TunisDefault (computer science)Exterior algebraSign (mathematics)Field (computer science)Orientation (vector space)Limit (category theory)Domain nameJava appletComputer animationLecture/Conference
Computer animation
Transcript: English(auto-generated)
I'm really happy to give this tutorial on AutoRL today and in the first half of this tutorial I'm going to spend the time really emphasizing what is different with RL, to supervise RL and what the impact is on the meta-problem that we're trying to solve.
And in the second half Teresa is then going to show you some solution approaches on how to tackle this problem. Now, we have already been introduced but I still wanted to highlight the AutoRL.org logo because we just recently set this up and wanted to have a resource where we collect more information on AutoRL and make it easily accessible and digestible.
So we're hoping to grow the space even more, so if you're interested in the topic after our tutorial, please check AutoRL.org. But now I want to get you excited about AutoRL, right? And to be able to answer the question, why does AutoRL matter, we also need to answer the question, why does RL matter?
And Mariusz gave you some ideas about embodied AI and so on, but not everyone in the room might have a good understanding of what RL is so far. So maybe a quick show of hands, who here has experience with RL? Quite some people, but still I need to make sure that everyone is on the same page.
So here's a short primer on RL. And RL can be very much summarized with this one statement. It's agents learning by interacting with their world. And this graphic that you see here, you will find in any RL textbook or RL lecture that just shows you how this interaction is happening.
So we have the world or the environment, which the agent can observe. And this observation is basically of the state of the environment. And based on the observation of the state of the environment, the agent decides how to act.
And this action then induces a change in the environment, which of course gives a new observation. And there's also an associated reward with this observation and this induced change. So this reward tells you, is this a good action to do when I observe a certain state or is it a bad choice?
And that's really already the reinforcement learning loop. You just interact with the environment and observe, was it a good interaction or was it a bad interaction? And then you do your learning updates. And to give you a small idea how this looks like, here is a short demo of a tabular reinforcement
learning setting where we have a small agent that needs to go from the top left to the top right. And it gets a reward of zero for every step, except when it falls down the cliffs. Then it gets a negative reward and it gets a positive reward when it reaches the goal.
And what you just saw is it interacts with the world. It does some random walk in the beginning. And now already it has observed some successful policies and it has an estimate of where the good rewards are. And in essence, in this small example, it gives us an estimate of how far away
from the goal it is and which actions it should take to then get to the goal. And if you let it now train forever, it will reach the goal. Yeah. And this is basically the auto-RL problem. Not auto-RL problem, the RL problem.
And there is a – I need to go back to the – just need to go back to the slides. But of course, this tabular example is not the only way. And you can do RL in much more sophisticated settings than just this very simple toy example with a grid world.
And there have been many well-publicized success stories, right? And the big one here I highlighted in the center really kicked off the so-called deep reinforcement learning revolution. Where all this function approximation that is needed in reinforcement learning is now being done by deep neural networks. And what really caught the attention of people is that you can now
train agents that are capable of outperforming humans in these video game playing tasks. There are of course other well-publicized stories that you have probably heard of like AlphaGo. But I wanted to highlight two other ones that are also more in this competition setting, more like the video games rather than the board games.
And on the left one, there is an example from Sony AI where they trained a deep reinforcement learning agent to play racing games and actually outperform the best human players. So a very, very high decision control task which is also very difficult to learn.
But Sony pulled it off and they had a very good result. And most recently, there was a deep reinforcement learning agent trained to control a racing drone. And the difference between the first two examples and the last one is it's actually applied in the real world. So deep reinforcement learning does not have to only happen in simulation.
And to come back to the first example, right, what was great about the Atari example is we now had the signal that we can learn these control tasks also from these high input samples with images.
So the control of the successful agents really is purely based on pixels. Now in the Sony AI example where they competed against the best human players, it is not just pixel based because you see this is a very high fidelity input otherwise. And what they did here is lots of domain expertise to structure the state space such that it is actually possible to learn in this example.
And what you can see here, the agent which we're following is trying to overtake the best human player which is in second place. And in the first place, there's already an AI. And here what you could see in this overtake is that the agent really first looked like it might lose control but it
has learned to basically overbreak and make the turn which is impossible for the human players because they cannot react to the same inputs. But again, this was all in simulation.
Now going to the real world and here again you can see the human player which is in red and the AI player which is in blue. And again, the agent has been trained in simulation and lots of domain expertise has been done to match this time pixel observations to IMU observations.
So what the internal measurement unit of the drone is seeing. But then if you train it properly and really take care of how to set hyperparameters and everything, you can get to a very well performing RL agent that can do this high control problem or solve this high control problem in the real world.
But I have hinted at this already. All previous examples worked well due to lots and lots of domain expertise. This domain expertise is not only there to set up the problem for the RL agent to solve
but also how to set up the RL agent itself, how to adapt the hyperparameters, how to modify the data in a way that the problem is actually tractable to learn with deep reinforcement learning methods. And there was a very influential paper from 2018 at the AAAI
which really showed the sensitivity of reinforcement learning to the choice of hyperparameters. And the paper is called Deep Reinforcement Learning That Matters and it showed that RL is sensitive to the choice in hyperparameters, the choice in network architecture, the choice in reward scale.
The choice of the random seeds already running it on different seeds has a huge impact on performance, much more impact than you might see in supervised learning. The choice of environment type and how the transition dynamics of the environment work as well as the code base.
Just using the PPO agent from one code base does not perform the same as the PPO agent from another code base even if you have exactly the same hyperparameters and environments and so on. So reinforcement learning, getting it to work, uses a huge amount of domain expertise and often you find people that are not familiar with RL saying RL doesn't work yet.
Even though we have seen these high impact publications where it clearly works. Now to emphasize that the choice of environment has a huge impact on RL, let's look at the example from this 2018 paper where they looked at the learning curves on the hopper environment and we want to maximize the average return here.
So on the hopper environment, the TRPO and PPO algorithms perform best whereas the DDPG algorithm is very, very noisy and performs worst. Switch it up to the half cheater environment, all of a sudden DDPG performs amazing and TRPO and PPO really just perform really, really poorly.
Switch it up again to the walker environment, well in the beginning DDPG seems to outperform TRPO but then later on is outperformed by TRPO. And that is just that because different algorithms tackle different issues of the RL problem.
So you don't have a silver bullet. But where the RL community is at the moment is there is lots and lots of heuristics on guiding your choice of which algorithm to use. So for example from the code base by Intel for the reinforcement learning coach, they had this heuristic where you have this graph that tells you which of the different algorithms are implemented.
And they asked you a few questions down below and then suggested, oh, for these specific answers to the questions you could use this agent or this agent. The code base is discontinued by now but just this year there was a 40 pager on archive appearing
that is basically just a bunch of if-else questions to help you figure out which RL algorithm to choose. And I'm not saying that these questions are not helpful in making your choice but it's very, very difficult to set up a meaningful heuristic that really generalizes across all environment types.
And for me the funny thing is if you already answered the first question about environment dynamics, if it's learnable or not, if you say yes, it's feasible to learn, then basically they tell you, oh, you're interested in data efficiency, look at model-based algorithms, we do not cover them here in this heuristic. All right, so basically already after answering the first question they tell you our heuristic is incomplete.
And the part from their discussion that I highlighted here, the yellow part really tells you there is no silver bullet, you really need to think about which algorithm you want to apply to which problem, so you have this algorithm selection problem.
And in the later part, in the red part, they hint at that you could actually potentially use automation because they say you should potentially ask a subset of these questions and do intermittent testing to verify if the question or if the expectation of the answer of the question is correct or not.
And if you were to automate that, right, you basically are optimizing the choice of which algorithm to use. Now that we have seen already an example of one of the choices in the AutoRL problem, let's also talk about hyperparameter landscapes because they are very different than to what you might be used to in supervised learning.
And here I want to emphasize again that learning is done by iteration. And in the supervised learning setting, you start with an initial model, you fit the model to your data, you observe the model performance that gives you some loss based on which you adapt the model, and you repeat this iteration until your budget is exhausted, right?
If we break down reinforcement learning, it looks very, very similar. You initialize your policy, you generate some observations with that policy, you observe the policy performance that gets you a loss based on which you adapt your policy, and you repeat that until your learning budget is exhausted.
But there is one crucial difference here, right? On the left side, you have a fixed data set, and on the right side, you always generate new data. So not only are you working with an iterative problem, you're working with an iterative problem where you change the data on the fly all the time.
So this is highly non-stationary. And to just emphasize this even more, the typical RL loop is in this generalized policy iteration that is schematically shown here. So you have your policy, and you get a value function that is specific to that policy.
It is fit on the data that you collect with that policy. Based on the value function that tells you if a choice of an action in a state is good or not, you try to act greedy with respect to the value function of that policy to search for an even better performing policy. Now you use the policy again to generate completely new data and get a completely new value function that is not based on the same data as before.
So you really not only iterate through the parameter space of your model, but you really also iterate through the data that is being generated. And a very well-known example from the RL community is just the Q-learning example, where you don't need to know all the details here.
But basically what you do to update your Q function, the value function, you take your old estimate on the left side here, and you move it by a gradient.
And you then see a familiar hyperparameter, the learning rate, that tells you just how much you can follow this gradient. But there is another hyperparameter that is important for reinforcement learning because it takes into account how much into the future you want to be able to predict the rewards that you get by following a certain policy.
Now, the parameters that are being learned in the RL step are the model parameters of the value function, typically a deep neural network. And the hyperparameters, in this case, are learning rate and discounting factor. Of course, there are some more hyperparameters involved, but just for our
example here, this suffices to look at learning rate and discounting factor. And there was a very nice paper at last year's AutoML conference by Aditya, who is sitting in the back, which actually looked at the landscapes and analyzed the landscapes of this Q-learning update with a deep neural network in there.
And basically, what is being done is schematically shown in the bottom left, right? You just run or you evaluate your policy with different configurations or your learning process with different configurations, and you branch off at specific points to see how performance improves or decreases when changing the policy at the hyperparameters after a certain time.
And in the beginning, if we look at the learning rate versus the discounting factor, it looks like we should take a very, very low learning rate and a very low discounting factor to be able to achieve the optimal returns.
After a while, though, we should have increased the gamma to be able to also better incorporate the future predictions. We don't need to act as greedy to the immediate rewards that we've seen, but we can actually start to optimize for the long-term gains. And in the end, we should have potentially decreased the gamma again a little bit, but start to increase the learning rate, right?
So you have these iterative landscapes that require that you do dynamic optimization rather than just the static optimization that is commonly done in the AutoML setting for supervised learning.
And just to reiterate, HPO4RL has to deal with a much, much stronger non-stationarity than you are used to. So while we can use some approaches from the AutoML community to do hyperparameter optimization, there will be examples where it just does not work well because we have this high non-stationarity.
But so far, we have only looked at selecting an algorithm and optimizing hyperparameters. But there's more to the RL pipeline than these two examples. And for that, I want to just tell you that the next figures are from the survey that Teresa and I were involved at on AutoRL.
And what we saw there is that we have these learner tunables that are choosing the algorithm or potentially even learning the algorithm in a meta-learning style, and they associate the type of parameters.
But then there's also the actor level, right? The network that is, for example, parameterizing the policy and how it interacts with the environment. In the interest of time, I'm not going to talk about networks today, but rather about the environment because this is, again, something which is different to the supervised learning setting that you are most likely used to.
So what is the role of the environment here? For the environment, because we don't have fixed data, we have data generators, we could, for example, optimize the observation spaces. And that is maybe something that you are also used to in the supervised learning setting where you,
for example, use some cutout techniques or flipping the images and so on to get more robust models. And that's something that you could also do here to emphasize a generalization of your RL agent. But then there are the reward functions. Reward functions are not always optimal for the task that you're trying to solve, right?
So you could try to optimize the reward function so that it guides the reinforcement learning much better for the particular task, or it guides the particular instantiation of a reinforcement learning agent better to the particular task. Then you could also adapt the action spaces because sometimes you do not need all the actions in the potential action space,
or you could use some small policies as intermediate actions to help the agent just explore easier over a longer time stack. And last but not least, you could also optimize the whole task curriculum, which consists of a combination of all of these things, where you first give the agent as simple a problem as possible so that it can learn
some initial successful policies, and then over time you try to increase the difficulty of your tasks in this curriculum. Right, so you have all these potential parts of optimizing elements of the environment to really help your RL agent.
And all these choices, again, are based on the algorithm that you're actually trying to optimize. So you might not need to do as much observation space tuning for a PPO agent than you might need to do for a DQN agent or a different class of agent altogether, right? So the environment, while it first looks like the given thing, you can do lots and lots
to actually tune your environment, to actually tune your algorithm and make the learning feasible and easy. And with that, I'm now going to hand over to Teresa, and she's going to talk to you about some success stories.
And we're actually going to start with environments as well. So I have three short examples which are maybe not on the cutting edge of what AutoRL is doing right now, but they're fairly well known, and I think it's easier to understand the dynamics of how they work. So let's start with something that is really fairly well known when you think about, it's called environment design or curriculum learning in AutoRL, it's prioritized level replay.
The idea is fairly simple. So you have these two pools on the left, one is called, here's the probability of new, what that means. Those are levels or task variations. They just call them levels because they worked with Atari-like video games to sample those.
And with some low probability, maybe 10%, we sample from that and just take some random new thing we haven't seen before. And then below, we see that's not a uniform distribution anymore. That's a pool of levels we've seen before, and for which we already know we did well on them, or maybe we didn't. We've not played that one in a while.
And these criteria of how well we did, how long we haven't seen that before, maybe even some extensions if you're creative, if this is maybe complimentary to our last observations, we sample those. We sample that 90% of the time because actually we want to get better at things we're not great at yet. And how do we score that? We can actually just look at something like the loss.
So if we have fairly high loss signal, it's reasonable to assume that we can learn something from that level. So in that case, we want to play that level again, take another look, take another try until the loss decreases, and then maybe we have another priority somewhere. So it's quite a simple idea in a way. We just have these two buffers, we
replay things we're not great at yet, and then from time to time, we get new information. But it turns out it's a fairly successful way of doing this task environment design of combining all these tuning of the observation spaces of the reward spaces of the curricula in one fell swoop in an unsupervised way.
So we also don't need to take care of a ton of hyperparameters on this level. Exactly, so this is a good example of the environment side. And you'll find this, I think, in a lot of work on environment design. But of course, you might be more familiar with hyperparameters. We've been talking a lot about hyperparameters this week.
So something that is also common dynamic here is hyperparameter optimization on the fly. The idea is simply we find hyperparameter schedules, because as we've seen, we're in a non-stationary setting. We want them to be dynamic, but we want to be efficient. We want to not spend a lot of, if any, resources on actually evaluating our configuration.
So what we start with is we start with training, but we only do that partially. We want the schedule after all, so we only train for as long as we want our first schedule segment to be. Then we ask for a new configuration from some optimizer. Could be something, a Bayesian optimization, doesn't matter. We use that configuration to fit the existing samples we've collected in step one.
And then we approximate the cost of that configuration in some way. So we can try a few ones before we decide on one. And this approximation is fairly simple. So basically, for each sample we collected, we just ask ourselves, is this sample more likely now that we fit on the data again than it was before?
So basically, do we now see this state more often, we think? Do we take this action more often than before? And then, of course, we want to weight that with the quality of the sample that we've seen before. If we now take, we'll see this more often, but we got a really, really bad reward we obviously want to avoid.
So that's a very simple strategy that gives us a natural dynamic configuration. An option which can fit the non-stationarity a lot better. Also, you see, 2019, it's actually been a while. This is still not that common, but it works fairly well.
The alternative to this that you might have heard about that is way more common in reinforcement learning than it is in supervised learning is population-based training. Population-based training also tries to do something dynamic. So common theme keeps being there. And it tries to do so by being parallel. So we don't try to avoid the evaluation costs. In this case, we just do it in parallel on something like an HPC cluster
and hope that the actual runtime stays constant anyway. So basically, we start with the population of agents with different quality. We do some partial training again, see how their specific hyperparameter configuration improves or doesn't.
And then we select and mutate the hyperparameters. But that also means we can actually throw away the weights of the worst agents. I mean, they didn't learn anything. So we can just replace them with the good ones. And in the end, we have one great agent that has a schedule built in. And we could bootstrap from, in this case, four different individuals during the run.
What these three have in common is that they're focused on dynamism. So I think population-based training is actually the one with the least dynamic schedule in here. And usually, you still have at least 10 segments of where you switch up the hyperparameters, switch up something like the environment. That's a common theme. And what's also a common theme that you might have observed is a lot of this is simply is not the correct term.
But what we've heard about AutoML tools, they can be very sophisticated. There are very many small moving parts, small improvements that go together. These are still very vague frameworks. They're not developed a lot yet.
And there's a good reason for that. And that is evaluation in AutoRL. AutoRL is pretty expensive. I'm not sure those of you who have experience with reinforcement learning might know this. Training a reinforcement learning agent can be a pain computationally. And AutoRL is, by logic, slightly worse.
What do I mean by slightly? Well, slightly enough that I think the biggest problem a lot of these methods have, and why they're not as common yet and why we have great papers from 2019 that don't see a lot of adoption, is that generalization of AutoRL methods is kind of lacking.
What do I mean by that? If we have our agent in our environment and fit an AutoRL method, find a great configuration, then, usually, you're probably used to that, we already see the next task that we want to solve. Slightly different, but probably similar.
Ideally, what we would like to see and what we see in the algorithm configuration paradigm, is we want to be able to generalize this solution that we found to things that look similar. For example, both of the middle upper one, that environment, the agent, they are in blue-green tones, just like our original one, should be similar enough, right?
The lower one, maybe not so much, but maybe we don't care. And being able to transfer this solution is crucial, because if we, for example, do this population-based training, spend four times as much as an already expensive run, then we also need to be able to reuse our result. And this is not quite so easy to check,
because evaluations in AutoRL are not well established yet. Oftentimes, we just look at the incumbent, say that's enough. You probably heard a lot this week that that's not usually enough. So what do we actually evaluate? I mean, first, let's look what we evaluate in the RL. And we already talked about rewards.
Usually, we care about something like the return formula. That's just all the rewards we saw. So usually, we have a task that starts somewhere, and we also get an ending signal somewhere. So if I want to have our robot dog fetch me a ball, I know the start is when I throw the ball, and the end is when it gets back to me or when I get too bored and say stop.
But you already heard about the discount factor. The discount factor comes in. So it's not just the sequence of rewards. It's the sequence of rewards weighted by how far in the future they are. And that is important, and that is a part of the task specification, actually, because if we think about this dog example, it says how I want the solution to look like.
You can think of it about, if I say, this discount factor is 0.9, I want the dog to act as if there's a 10% chance at each step that I get too bored. So it's incentivized to actually be quick about it and not just loiter around a lot. So we already have this temporal dependency, and that means we really need a sequence of these rewards.
We can't just say, oh, yeah, let's do three steps, and we get an estimate of where we end up. So already, cost factor one. And then we encounter a big problem, and this is still in the RL loop, not in the auto-RL loop. Big problem is both parts involved here, the environment and the policy, can be stochastic,
and both can be highly stochastic. Example, let's say we have this super easy task. The agent can just move left or right. We try it once. The agent moves right, gets a reward of five. Great. Then we do a second trial. The agent goes right. Everything looks the same. The agent gets a reward of minus one. This is what happens in an environment that is stochastic.
It has a stochastic reward function. The same can also happen in somewhat reverse, that our environment behaves the same, but our agent, instead of doing the great thing it has learned to do and has done before, just goes the other way. If we have a stochastic policy, that's possible.
So what that means is that we need to repeat the sequence that we need to find for evaluation, even just to get an estimate of the RL reward. We need repetitions. Usually, we just average across multiple iterations of the task, and this is without training. This is just for evaluation, which means we immediately get a trade-off
between extra cost and reliability of what we want to measure in the RL level. Also, some return distributions just look a bit crazy. They're sometimes very bimodal. You either get failure and success, and then your average can look very strange. Additionally, we have things like task variations,
as I said, and seeding is a big issue in reinforcement learning. So that means we need to do these multiple rollouts. We need to average somehow, and then we need to do that for each seed and each variation we try out. You can immediately see, A, we lose a bunch of information since our metrics are often interquartile means, and B, we need to do a lot of these rollouts.
The rollout is a direct environment interaction. If my environment is really, I want to walk my robot dog, that means that it's the time and cost, and even in computational cost, the environment interaction's not free, right? I don't have the sample lying somewhere in my server and dataset.
That is something that needs to be computed in some way, and some environments can be expensive. So if we go through this in like a, let's say our RL evaluation cost is one, if we do this on the auto-RL level, what happens? First, we need to obviously evaluate our RL across seeds.
This is just proper practice, and our RL across seeds can really look like success, failure for the same configuration. This is something I think if we do it less than 10 times, there's no chance we get a good estimate of what the RL agent is. But then, this is our unstable target function, unreliable estimates.
We have a stochastic auto-RL method oftentimes because if you remember what I talked about, we have optimizers in the loop, we have mutation strategies, we have some unsupervised system that samples, so we definitely need some seeds. Let's say it's quite reliable. We only need five seeds to test our auto-RL system.
Suddenly, we're at 50 times the evaluation cost without having actually done any optimization. And then, if we have different task variations, we need to do these 50 per task. So properly evaluating these RL methods that I showed you is costly in itself without even running them, without having trained an agent.
And that's why maybe auto-RL is not something we hear about a lot at these summer schools or auto-RL conferences yet, because this cost has been prohibited for a long time. We're slowly getting to a point where implementation is somewhere we can actually work with it much easier.
But it is a significant cost to auto-RL research. So what's the solution then? Well, let's look at what's been happening in HBO. HBO, I think, is a good case study where auto-RL is moving and how to maybe do that, how to join that movement,
or what to do in terms of I want to apply RL, which auto-RL methods can I use? Because as an auto-RL in general, an auto-RL HBO is a big topic. It's probably the biggest topic next to environment design. And it's also something that, as Andrea alluded to, is very important. So we already know that our algorithms
have many hyperparameters, and the algorithms are sensitive to many hyperparameters. So it's not just that there are between 10 and 20 hyperparameters you could configure. It's that usually you also need to configure some of them. I linked a paper we did last year
where we also have a lot of different plots where you can see that oftentimes we have really sharp drop-offs, and almost each hyperparameter can be responsible for the failure of an algorithm. But then if we don't look at sensitivity but importance, which is a different thing altogether, right? If you have a sensible default, that can still mean you don't need to tune it
each time you need to apply an algorithm. An important hyperparameter, though, you usually do need to tune for a given environment. And we also looked at that, and it turns out, obviously, as luck would have it, those are highly domain dependent as well. So he even switched to something that looks fairly similar. So in this case, we looked at two.
These are both toy examples that are really simple tasks in the end. This Acrobot task is where you just swing up a pendulum. And the other one is in a 5x5 grid, find the goal. So they look similar, they're not complex. But even here, the four most important hyperparameters on the left side, they're distributed all over the place on the right.
And what you can also see, and this is also something very interesting in HBO4RL compared to supervised learning, the learning rate is among those four, right? It's somewhere in the middle on the left, and on the right, the learning rate is irrelevant. I'm not sure when you've last seen that for a supervised learning problem, but we haven't been able to identify a set of hyperparameters that tend to always matter.
There is no default as of, let's just tune the learning rate and we'll be fine. That doesn't exist, or doesn't seem to exist just yet. So this is actually something we need to take care of. And if I would ask you now, even with just those two facts,
you'll probably tell me, oh yeah, that sounds like we should just use an AutoML tool to do that, right? Obviously in practice, it doesn't look like that. In research and practice, it's often grid search that's used, unfortunately. Which leads to a lot of people having these issues. But it also has to be said, there's little systematic work on the automation of this,
partly due to the high computational cost. That's also why I'm not gonna tell you use tool X or Y to train you, tune your reinforcement learning agents. I don't think we're at the point where we can securely say what is a best practice or a best method or somewhere. But I think there are two great efforts
in benchmarking HPO4L more reliably and more efficiently that can really show where this field is going in the next, even just one or two years. And the first one is zero cost benchmarking. And given that I've talked a lot about how expensive benchmarking is, that's probably good news.
So this is fairly fresh. That's going to be presented next week at the AutoML Conf actually. It's a tabular benchmark for reinforcement learning. It's called HPORL Bench. Tabular benchmark means the team evaluated five algorithms on a total of 22 environments with a lot of configurations and these are in a database, in a lookup table in that repository, free to use.
So if you want to evaluate an AutoRL method, what you can do, if you match the search space, you can simply look up the value. So suddenly we went to zero cost evaluation. This also supports dynamic configuration schedules, which I think is great since a lot of RL methods
are dynamic and we have good evidence that this is really something we need to take care of in AutoRL as well. Plus you get pre-evaluated baselines. I think that's the bottom here. They evaluated things like random search, Optuna tools, SMAC, PBT, DieHPO. A lot of common things are already in there.
So this is a great place to get started. And then you also get an orientation of what these results look like. Because I think also the paper has some great results on what hyperparameter importances are across different RL domains, things like that. And it really shows, again, the importance of tuning and not trying to be intuitive
because even a supervised learning, this can fail. In RL, the intuition seems to be far more complex from what we've seen so far. Obviously this work has the big limitation that all tabular benchmarks have. It's a table. So that means if you want to do anything that's not in the table, you're out of luck.
So there's another new benchmark also from this year that's technically not available yet, is RLBench. So this is basically for if you, for some reason, cannot use HBO or RLBench, or if you want to look deeper into things like how does seeding impact the evaluation or something like that? How can it be more efficient than that?
RLBench is basically just a very efficient implementation of free RL algorithms in JAX. JAX is great for RL. JAX is very annoying. If you used it, use it for the first time. If you've never worked with it before and tried it for the first time, you know what I mean.
It looks very different than Python. It also looks very different from Java or C or whatever you started with, but it can make a lot of machine learning tasks much faster, and in the case of RL, some people have gotten speed ups of a thousand times over their baselines. What that means is if you have it in a benchmark, you don't need to take care of the implementation yourself.
You don't need to learn the annoying language. You can simply use it. Additionally, we did a subset selection because we also started with 22 environments, and 22 environments are a lot, even if it's efficient to run. You have results for 22 different domains. What do you even do with that? Do you average across that?
What can you say about that? That's really hard to navigate. So we decided both for speed up reasons and simplicity reasons, we're going to select a subset of relevant benchmarks, so four to five for each algorithm, and that brings it down over speed ups seven to 12 times, depending on what you actually run on. And also, just like HP or RLBench,
it has dynamic configuration, some flexible objective, large search spaces, all the things you want to look into how optimizers actually compare. And that part, I thought, was very interesting because we did see how optimizers compare. And yeah, those look really similar, don't they?
What others? There's a random search, so we know that. SMAC, so Bayesian optimization. There's also SMAC with Hyperband, so Bayesian optimization with multi-fidelity built in. And I mean, PBT, I don't know if many of you ran that before. We can make it know that for now.
For supervised learning, it's actually been a long, I would say it's an established fact that random search is not as good as the optimizers we have. I think there was, at least there was a paper about a nervous competition claiming that. I'm not sure if that's really true for all domains. But if we look either on the full set
for this algorithm, actually all the other algorithms we tested as well, and the subset, we see that in this case, there is maybe a slight improvement for Bayesian optimization over random search. But if we use Hyperband, it goes away. One of the algorithms actually switched around and the confidence bounds overlap so much
we can't actually say anything. That means RL is actually extremely interesting in terms of hyperparameter optimization because what happens if suddenly Bayesian optimization and a black box optimization is not the best thing we can do anymore? I think that's a really great sign that we can be very creative in this space
and look for especially dynamic solutions, but also just alternatives because I think in supervised learning, black box optimization is just so stable. It's so stable and we have this great body of research that makes it really efficient for the benchmarks we look at. And here we have a different beast altogether and it just seems to work differently.
And that means we can be really innovative with the approaches we try to tackle that with. And now I'd say, we'd still leave the rest of your time for your questions. Thank you.