Part 1 of 3 — Proximal Policy Optimization Implementation: 11 Core Implementation Details

Автор: Weights & Biases

Загружено: 2021-09-10

Просмотров: 56695

Описание:

Proximal Policy Optimization (PPO) is one of the most popular reinforcement learning algorithms, and works with a variety of domains from robotics control to Atari games to chip design

In this video, we dive deep into 11 core implementation details of PPO and build the algorithm from scratch in PyTorch, step-by-step.

---

Source code: https://github.com/vwxyzjn/ppo-implem...
Related blog post: https://iclr-blog-track.github.io/202...
Background music: Flutes Will Chill — https://artlist.io/song/48722/flutes-...

---

0:00 Introduction
2:01 Dev environment
2:19 Common variables
3:18 Tensorboard
4:02 Weights and Biases
6:05 1. Vector environment
9:53 Agent setup
10:13 2. Layer initialization
11:48 3. Adam's epsilon
12:15 Training loop
15:36 4. Learning rate annealing
17:15 5. General Advantage Estimation
18:49 6. Minibatch update
20:22 7. Advantage normalization
20:45 8. Clipped objective
21:07 9. Value loss clipping
21:32 10. Entropy loss
22:12 11. Global gradient clipping
22:30 Debug variables
23:10 Bonus. Early stopping
24:17 Visualize training on W&B

Part 1 of 3 — Proximal Policy Optimization Implementation: 11 Core Implementation Details

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

array(10) { [0]=> object(stdClass)#7008 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "05RMTj-2K_Y" ["related_video_title"]=> string(75) "Proximal Policy Optimization Implementation: 9 Atari-specific Details (2/3)" ["posted_time"]=> string(21) "3 года назад" ["channelName"]=> string(16) "Weights & Biases" } [1]=> object(stdClass)#6981 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "hlv79rcHws0" ["related_video_title"]=> string(75) "Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial" ["posted_time"]=> string(21) "4 года назад" ["channelName"]=> string(26) "Machine Learning with Phil" } [2]=> object(stdClass)#7006 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "TjHH_--7l8g" ["related_video_title"]=> string(71) "Proximal Policy Optimization (PPO) - How to train Large Language Models" ["posted_time"]=> string(19) "1 год назад" ["channelName"]=> string(15) "Serrano.Academy" } [3]=> object(stdClass)#7013 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "cQfOQcpYRzE" ["related_video_title"]=> string(58) "Policy Gradient Theorem Explained - Reinforcement Learning" ["posted_time"]=> string(21) "4 года назад" ["channelName"]=> string(12) "Elliot Waite" } [4]=> object(stdClass)#6992 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "8jtAzxUwDj0" ["related_video_title"]=> string(65) "Proximal Policy Optimization (PPO) for LLMs Explained Intuitively" ["posted_time"]=> string(25) "3 месяца назад" ["channelName"]=> string(10) "Julia Turc" } [5]=> object(stdClass)#7010 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "Yi1UCrAsf4o" ["related_video_title"]=> string(60) "Group Relative Policy Optimization (GRPO) - Formula and Code" ["posted_time"]=> string(25) "4 месяца назад" ["channelName"]=> string(25) "Deep Learning with Yacine" } [6]=> object(stdClass)#7005 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "LQl460dFw74" ["related_video_title"]=> string(86) "Угроза окружения в «треугольнике смерти»" ["posted_time"]=> string(21) "4 часа назад" ["channelName"]=> string(18) "The Breakfast Show" } [7]=> object(stdClass)#7015 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "p0Ri2tNb-6I" ["related_video_title"]=> string(186) "Человечество навсегда ЗАПЕРТО в Солнечной системе? Астрофизик Борис Штерн раскрыл неприятную правду" ["posted_time"]=> string(24) "19 часов назад" ["channelName"]=> string(23) "Глеб Соломин" } [8]=> object(stdClass)#6991 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "1ppslywmIPs" ["related_video_title"]=> string(34) "Does your PPO agent fail to learn?" ["posted_time"]=> string(21) "2 года назад" ["channelName"]=> string(7) "RL Hugh" } [9]=> object(stdClass)#7009 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "e20EY4tFC_Q" ["related_video_title"]=> string(55) "Policy Gradient Methods | Reinforcement Learning Part 6" ["posted_time"]=> string(21) "2 года назад" ["channelName"]=> string(18) "Mutual Information" } }

Proximal Policy Optimization Implementation: 9 Atari-specific Details (2/3)

Proximal Policy Optimization Implementation: 9 Atari-specific Details (2/3)

Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial

Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial

Proximal Policy Optimization (PPO) - How to train Large Language Models

Proximal Policy Optimization (PPO) - How to train Large Language Models

Policy Gradient Theorem Explained - Reinforcement Learning

Policy Gradient Theorem Explained - Reinforcement Learning

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

Group Relative Policy Optimization (GRPO) - Formula and Code

Group Relative Policy Optimization (GRPO) - Formula and Code

Угроза окружения в «треугольнике смерти»

Угроза окружения в «треугольнике смерти»

Человечество навсегда ЗАПЕРТО в Солнечной системе? Астрофизик Борис Штерн раскрыл неприятную правду

Человечество навсегда ЗАПЕРТО в Солнечной системе? Астрофизик Борис Штерн раскрыл неприятную правду

Does your PPO agent fail to learn?

Does your PPO agent fail to learn?

Policy Gradient Methods | Reinforcement Learning Part 6

Policy Gradient Methods | Reinforcement Learning Part 6