"Near-optimal Regret in Online MDPs with Aggregate Bandit Feedback" - Tal Lancewicki, Research TTIC
Автор: TTIC
Загружено: 18 мар. 2025 г.
Просмотров: 35 просмотров
Near-optimal Regret in Online MDPs with Aggregate Bandit Feedback
Description:
Tal Lancewicki explores the challenge of learning online Markov decision processes (MDPs) with aggregate bandit feedback, where agents receive only total trajectory loss instead of step-by-step feedback. He reviews existing approaches and introduces a new policy optimization algorithm to improve learning efficiency in this setting, with applications in AI, robotics, and reinforcement learning.
Tal Lancewicki, Tel Aviv University
Originally recorded on March 10, 2025.
Timestamps:
00:00 – Intro
00:55 – Talk
48:48 – Conclusion and Q&A
Tags:
#AI #MachineLearning #ReinforcementLearning #MDP #BanditAlgorithms #PolicyOptimization #Robotics #ComputationalTheory #ResearchTalk

Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: