NVIDIA DYNAMO: Serving LLMs at AI-Factory Scale

Автор: AIFoundry Org

Загружено: 2025-12-14

Просмотров: 27

Описание:

On October 25th, in SF we got together to discuss “What’s missing in an open-source full-stack AI platform?”

The AI Plumbers Unconference: San Francisco Edition is an open-source meetup for builders of low-level AI systems to dive into the plumbing of modern AI, from modern data infrastructure to AI accelerators.

Watch #AIPlumbers presentation by NVIDIA team on Dynamo, a deep dive into production environment for inference at scale, i.e. both compute and memory demands exploding exponentially.

Disaggregated serving, intelligent scheduling, multi-tier memory management, KV-routing, and high-availability mechanics — all designed to push inference efficiency to the maximum.

This #AIPlumbers talk showcased production-grade engineering: from offline performance configurators that find optimal cluster layouts, to dynamic K8s scheduling that understands physical GPU topology, coordinated multi-GPU serving, etc. Lot’s of clever tricks on handling compute-bound vs memory-bound workloads, I’ve heard people discussing before, but now not in theory! And it’s all #opensource.

And also really hope to hear more at #FOSDEM26 from the Dynamo team - don’t miss it!

Key moments from the talk:

00:00 – 01:02 — Dynamo: Inference at Scale

01:03 – 02:49 — Inference Compute Requirements Scaling Exponentially

02:50 – 05:59 — Dynamo: A Systematic Approach to AI Inference at Scale

06:00 – 08:54 — Memory Management

08:55 – 12:19 — KV Router

12:20 – 15:00 — Production-Grade Serving with Dynamo

15:01 – 16:33 — Offline Perf Configurator

16:34 – 18:39 — Offline Perf Optimizer

18:40 – 26:00 — Topology-Optimized Dynamic K8s Scheduling

26:01 – 29:22 — Fault Tolerance

29:23 – 32:32 — How Dynamo WorksOn October 25th, in SF we got together to discuss “What’s missing in an open-source full-stack AI platform?”

The AI Plumbers Unconference: San Francisco Edition is an open-source meetup for builders of low-level AI systems to dive into the plumbing of modern AI, from modern data infrastructure to AI accelerators.

Watch #AIPlumbers presentation by NVIDIA team on Dynamo, a deep dive into production environment for inference at scale, i.e. both compute and memory demands exploding exponentially.

Disaggregated serving, intelligent scheduling, multi-tier memory management, KV-routing, and high-availability mechanics — all designed to push inference efficiency to the maximum.

This #AIPlumbers talk showcased production-grade engineering: from offline performance configurators that find optimal cluster layouts, to dynamic K8s scheduling that understands physical GPU topology, coordinated multi-GPU serving, etc. Lot’s of clever tricks on handling compute-bound vs memory-bound workloads, I’ve heard people discussing before, but now not in theory! And it’s all #opensource.

And also really hope to hear more at #FOSDEM26 from the Dynamo team - don’t miss it!

Key moments from the talk:

00:00 – 01:02 — Dynamo: Inference at Scale

01:03 – 02:49 — Inference Compute Requirements Scaling Exponentially

02:50 – 05:59 — Dynamo: A Systematic Approach to AI Inference at Scale

06:00 – 08:54 — Memory Management

08:55 – 12:19 — KV Router

12:20 – 15:00 — Production-Grade Serving with Dynamo

15:01 – 16:33 — Offline Perf Configurator

16:34 – 18:39 — Offline Perf Optimizer

18:40 – 26:00 — Topology-Optimized Dynamic K8s Scheduling

26:01 – 29:22 — Fault Tolerance

29:23 – 32:32 — How Dynamo Works

NVIDIA DYNAMO: Serving LLMs at AI-Factory Scale

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Adventures in Model Quantization and GPU performance, John Leimgruber, Community LLM Quantizer

Adventures in Model Quantization and GPU performance, John Leimgruber, Community LLM Quantizer

"Meeting devs where they're at", Felix Leclair, AI Plumbers Conference

Collaboration in AI datacenter buildout: #126 with John Peterson

Collaboration in AI datacenter buildout: #126 with John Peterson

"Attention, Transformers llama.cpp is all you need", Roman Shaposhnik, AI Plumbers Conference

Краткий обзор новой версии n8n 2.0 🚀

Краткий обзор новой версии n8n 2.0 🚀

Stylus in 2025: State of the Ecosystem & What Comes Next — Srinjoy Chakraborty (Offchain Labs)

Stylus in 2025: State of the Ecosystem & What Comes Next — Srinjoy Chakraborty (Offchain Labs)

Как считает квантовый компьютер? Самое простое объяснение!

Как считает квантовый компьютер? Самое простое объяснение!

Теренс Тао о том, как Григорий Перельман решил гипотезу Пуанкаре | Лекс Фридман

Теренс Тао о том, как Григорий Перельман решил гипотезу Пуанкаре | Лекс Фридман

ЛУЧШАЯ БЕСПЛАТНАЯ НЕЙРОСЕТЬ Google, которой нет аналогов

ЛУЧШАЯ БЕСПЛАТНАЯ НЕЙРОСЕТЬ Google, которой нет аналогов

Чем ОПАСЕН МАХ? Разбор приложения специалистом по кибер безопасности

Чем ОПАСЕН МАХ? Разбор приложения специалистом по кибер безопасности

Внутренняя красота пассивных электронных компонентов: 3D Анимация (CGI) устройство радиодеталей

Внутренняя красота пассивных электронных компонентов: 3D Анимация (CGI) устройство радиодеталей

Ночные пробуждения в 3–4 часа: как найти причину и вернуть глубокий сон.

Ночные пробуждения в 3–4 часа: как найти причину и вернуть глубокий сон.

Цепи Маркова — математика предсказаний [Veritasium]

Цепи Маркова — математика предсказаний [Veritasium]

What's missing in the open ecosystem for AI?

What's missing in the open ecosystem for AI?

Почему человечество застряло на орбите? — Семихатов, Сурдин

Почему человечество застряло на орбите? — Семихатов, Сурдин

ЛИПСИЦ: "Вас обманули. Я объясню". Что не так с Набиуллиной, спор с Иноземцевым, слом Путина

КАК УСТРОЕН TCP/IP?

КАК УСТРОЕН TCP/IP?

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Как Ubuntu Предала Linux - Вся Правда о Взлёте и Падении Canonical

Как Ubuntu Предала Linux - Вся Правда о Взлёте и Падении Canonical

Путешествие в заквантовый мир. Визуализация субатомных частиц, вирусов, и молекул

Путешествие в заквантовый мир. Визуализация субатомных частиц, вирусов, и молекул