Accelerating vLLM with LMCache | Ray Summit 2025

Автор: Anyscale

Загружено: 2025-11-19

Просмотров: 596

Описание:

At Ray Summit 2025, Kuntai Du from TensorMesh shares how LMCache expands the resource palette for serving large language models—making LLM inference faster and more cost-efficient by moving beyond GPU-only execution.

He begins by highlighting a key limitation in today’s serving stacks: KV-cache memory demands often exceed what GPUs alone can provide efficiently. LMCache addresses this by enabling KV-cache offloading to a wide range of datacenter resources—including CPU memory, local disk, and remote storage—and dynamically loading caches back to GPUs on demand. This unlocks new flexibility and dramatically reduces GPU memory pressure.

But LMCache goes far beyond simple prefix caching. Kuntai introduces KV-cache–related machine learning techniques that allow the inference engine to:

Reuse KV caches for non-prefix text

Share and reuse caches across different LLMs

Improve inference efficiency even for complex, non-sequential workloads

These innovations enable faster inference, lower cost, and improved hardware utilization without modifying model architectures.

Attendees will learn how LMCache opens new frontiers in LLM serving by leveraging broader datacenter resources and smart KV-cache reuse strategies—delivering scalable performance improvements even for the largest models.

Subscribe to our YouTube channel to stay up-to-date on the future of AI! / anyscale

🔗 Connect with us:
LinkedIn: / joinanyscale
X: https://x.com/anyscalecompute
Website: https://www.anyscale.com/

Accelerating vLLM with LMCache | Ray Summit 2025

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

How AWS Scales Reinforcement Learning Across Thousands of GPUs | Ray Summit 2025

How AWS Scales Reinforcement Learning Across Thousands of GPUs | Ray Summit 2025

Inside NVIDIA Dynamo: Faster, Scalable AI Deployment | Ray Summit 2025

Inside NVIDIA Dynamo: Faster, Scalable AI Deployment | Ray Summit 2025

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

Ray Direct Transport: RDMA Support in Ray Core | Ray Summit 2025

Ray Direct Transport: RDMA Support in Ray Core | Ray Summit 2025

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

Теория систем и архитектурная практика / Филипп Дельгядо

Теория систем и архитектурная практика / Филипп Дельгядо

Scaling Ray on Kubernetes: Pragmatic Strategies for Every Team | Ray Summit 2025

Scaling Ray on Kubernetes: Pragmatic Strategies for Every Team | Ray Summit 2025

Kubernetes — Простым Языком на Понятном Примере

Kubernetes — Простым Языком на Понятном Примере

C // дополнения к базе языка

C // дополнения к базе языка

Цепи Маркова — математика предсказаний [Veritasium]

Цепи Маркова — математика предсказаний [Veritasium]

How BMW Scales Automotive AI Workloads with the Ray Framework | Ray Summit 2025

How BMW Scales Automotive AI Workloads with the Ray Framework | Ray Summit 2025

FULL: Elon Musk Makes Shocking Future Predictions At U.S.-Saudi Arabia Forum Alongside Jensen Huang

FULL: Elon Musk Makes Shocking Future Predictions At U.S.-Saudi Arabia Forum Alongside Jensen Huang

SkyRL tx: A unified training and inference engine | Ray Summit 2025

SkyRL tx: A unified training and inference engine | Ray Summit 2025

Стивен Вольфрам «решил» Второй закон? Вот что он на самом деле сделал

Стивен Вольфрам «решил» Второй закон? Вот что он на самом деле сделал

Музыка для работы за компьютером | Фоновая музыка для концентрации и продуктивности

Музыка для работы за компьютером | Фоновая музыка для концентрации и продуктивности

«Крупнейшая утечка данных в истории»

«Крупнейшая утечка данных в истории»

Ray Serve: Advancing scalability and flexibility | Ray Summit 2025

Ray Serve: Advancing scalability and flexibility | Ray Summit 2025

Визуализация внимания, сердце трансформера | Глава 6, Глубокое обучение

Визуализация внимания, сердце трансформера | Глава 6, Глубокое обучение

Scaling Machine Learning at Tripadvisor: Our Journey with Ray and Anyscale | Ray Summit 2025

Scaling Machine Learning at Tripadvisor: Our Journey with Ray and Anyscale | Ray Summit 2025

Hybrid RL + Imitation Learning for Robotics with Ray at RAI Institute | Ray Summit 2025

Hybrid RL + Imitation Learning for Robotics with Ray at RAI Institute | Ray Summit 2025