Dynamic Scheduling for Large Language Model Serving | Ray Summit 2024
Автор: Anyscale
Загружено: 21 окт. 2024 г.
Просмотров: 468 просмотров
Hanyu Zhao from Alibaba Group presents Llumnix, a dynamic request scheduling system for large language models, at Ray Summit 2024. Built on vLLM and Ray, Llumnix addresses key challenges in LLM serving through innovative runtime rescheduling and KV cache migration across instances.
Zhao discusses how Llumnix reduces prefill latencies through cross-instance defragmentation and minimizes tail decoding latencies by balancing loads and reducing preemptions. The talk covers the research journey behind Llumnix, from its origins to its publication at OSDI '24, and its subsequent deployment and evolution at Alibaba.
The presentation provides insights into the current state of Llumnix and outlines future development plans. Zhao also highlights the open-source nature of the project, available on GitHub, encouraging community engagement and collaboration.
This session offers valuable information for those interested in optimizing LLM serving, particularly in large-scale, high-performance environments. It demonstrates practical applications of Ray and vLLM in addressing complex scheduling challenges in AI infrastructure.
--
Interested in more?
Watch the full Day 1 Keynote: • Ray Summit 2024 Keynote Day 1 | Where Buil...
Watch the full Day 2 Keynote • Ray Summit 2024 Keynote Day 2 | Where Buil...
--
🔗 Connect with us:
Subscribe to our YouTube channel: / @anyscale
Twitter: https://x.com/anyscalecompute
LinkedIn: / joinanyscale
Website: https://www.anyscale.com

Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: