NVIDIA DYNAMO: Serving LLMs at AI-Factory Scale
Автор: AIFoundry Org
Загружено: 2025-12-14
Просмотров: 27
On October 25th, in SF we got together to discuss “What’s missing in an open-source full-stack AI platform?”
The AI Plumbers Unconference: San Francisco Edition is an open-source meetup for builders of low-level AI systems to dive into the plumbing of modern AI, from modern data infrastructure to AI accelerators.
Watch #AIPlumbers presentation by NVIDIA team on Dynamo, a deep dive into production environment for inference at scale, i.e. both compute and memory demands exploding exponentially.
Disaggregated serving, intelligent scheduling, multi-tier memory management, KV-routing, and high-availability mechanics — all designed to push inference efficiency to the maximum.
This #AIPlumbers talk showcased production-grade engineering: from offline performance configurators that find optimal cluster layouts, to dynamic K8s scheduling that understands physical GPU topology, coordinated multi-GPU serving, etc. Lot’s of clever tricks on handling compute-bound vs memory-bound workloads, I’ve heard people discussing before, but now not in theory! And it’s all #opensource.
And also really hope to hear more at #FOSDEM26 from the Dynamo team - don’t miss it!
Key moments from the talk:
00:00 – 01:02 — Dynamo: Inference at Scale
01:03 – 02:49 — Inference Compute Requirements Scaling Exponentially
02:50 – 05:59 — Dynamo: A Systematic Approach to AI Inference at Scale
06:00 – 08:54 — Memory Management
08:55 – 12:19 — KV Router
12:20 – 15:00 — Production-Grade Serving with Dynamo
15:01 – 16:33 — Offline Perf Configurator
16:34 – 18:39 — Offline Perf Optimizer
18:40 – 26:00 — Topology-Optimized Dynamic K8s Scheduling
26:01 – 29:22 — Fault Tolerance
29:23 – 32:32 — How Dynamo WorksOn October 25th, in SF we got together to discuss “What’s missing in an open-source full-stack AI platform?”
The AI Plumbers Unconference: San Francisco Edition is an open-source meetup for builders of low-level AI systems to dive into the plumbing of modern AI, from modern data infrastructure to AI accelerators.
Watch #AIPlumbers presentation by NVIDIA team on Dynamo, a deep dive into production environment for inference at scale, i.e. both compute and memory demands exploding exponentially.
Disaggregated serving, intelligent scheduling, multi-tier memory management, KV-routing, and high-availability mechanics — all designed to push inference efficiency to the maximum.
This #AIPlumbers talk showcased production-grade engineering: from offline performance configurators that find optimal cluster layouts, to dynamic K8s scheduling that understands physical GPU topology, coordinated multi-GPU serving, etc. Lot’s of clever tricks on handling compute-bound vs memory-bound workloads, I’ve heard people discussing before, but now not in theory! And it’s all #opensource.
And also really hope to hear more at #FOSDEM26 from the Dynamo team - don’t miss it!
Key moments from the talk:
00:00 – 01:02 — Dynamo: Inference at Scale
01:03 – 02:49 — Inference Compute Requirements Scaling Exponentially
02:50 – 05:59 — Dynamo: A Systematic Approach to AI Inference at Scale
06:00 – 08:54 — Memory Management
08:55 – 12:19 — KV Router
12:20 – 15:00 — Production-Grade Serving with Dynamo
15:01 – 16:33 — Offline Perf Configurator
16:34 – 18:39 — Offline Perf Optimizer
18:40 – 26:00 — Topology-Optimized Dynamic K8s Scheduling
26:01 – 29:22 — Fault Tolerance
29:23 – 32:32 — How Dynamo Works
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: