Reinforcement learning - lecture 2

Zitieren

Zugehöriges Material

Centre International de Rencontres Mathématiques (CIRM)

Lazaric, Allesandro

Formale Metadaten

Titel

Reinforcement learning - lecture 2

Serientitel

Mathematics, Signal Processing and Learning, 2021

Anzahl der Teile

Autor

Lazaric, Allesandro

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - keine Bearbeitung 2.0 Generic:
Sie dürfen das Werk bzw. den Inhalt in unveränderter Form zu jedem legalen und nicht-kommerziellen Zweck nutzen, vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.24350/CIRM.V.19705003 (DOI)

Herausgeber

Centre International de Rencontres Mathématiques (CIRM)

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik Mathematik

Genre

Workshop/Interaktives Format Vorlesung

Abstract

Reinforcement learning (RL) studies the problem of learning how to optimally controlling a dynamical and stochastic environment. Unlike in supervised learning, a RL agent does not receive a direct supervision on which actions to take in order to maximize the longterm reward, and it needs to learn from the samples collected through direct interaction with the environment. RL algorithms combined with deep learning tools recently achieved impressive results in a variety of problems ranging from recommendation systems to computer games, often reaching human-competitive performance (e.g., in the Go game). In this course, we will review the mathematical foundations of RL and the most popular algorithmic strategies. In particular, we will build around the model of Markov decision processes (MDPs) to formalize the agent-environment interaction and ground RL algorithms into popular dynamic programming algorithms, such as value and policy iteration. We will study how such algorithms can be made online, incremental and how to integrate approximation techniques from the deep learning literature. Finally, we will discuss the problem of the exploration-exploitation dilemma in the simpler bandit scenario as well as in the full RL case. Across the course, we will try to identify the main current limitations of RL algorithms and the main open questions in the field.