vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

Автор: PyTorch

Загружено: 2024-10-01

Просмотров: 10808

Описание:

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

We will present vLLM, an open-source high-performance LLM inference engine built on top of PyTorch. Starting as a research project at UC Berkeley, vLLM has been one of the fastest and most popular LLM inference solutions in industry, reaching 20K+ stars and 350+ contributors. In this talk, we will cover how vLLM adopts various LLM inference optimizations and how it supports various AI accelerators such as AMD GPUs, Google TPUs, and AWS Inferentia. Also, we will discuss how vLLM benefits from PyTorch 2 and its ecosystem.

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

CUDA Mode Keynote | Lily Liu | vLLM

CUDA Mode Keynote | Lily Liu | vLLM

Блиц-доклад: Самый быстрый путь к производству: вывод PyTorch на Python — Марк Саруфим, Meta

Блиц-доклад: Самый быстрый путь к производству: вывод PyTorch на Python — Марк Саруфим, Meta

Simon Mo on vLLM: Easy, Fast, and Cost-Effective LLM Serving for Everyone

Simon Mo on vLLM: Easy, Fast, and Cost-Effective LLM Serving for Everyone

KV Cache: The Trick That Makes LLMs Faster

KV Cache: The Trick That Makes LLMs Faster

The State of vLLM | Ray Summit 2024

The State of vLLM | Ray Summit 2024

vLLM: Easily Deploying & Serving LLMs

vLLM: Easily Deploying & Serving LLMs

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

vLLM on Kubernetes in Production

vLLM on Kubernetes in Production

Большинство разработчиков не понимают, как работают токены LLM.

Большинство разработчиков не понимают, как работают токены LLM.

Краткое объяснение больших языковых моделей

Краткое объяснение больших языковых моделей

Distributed ML Talk @ UC Berkeley

Distributed ML Talk @ UC Berkeley

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Enabling Cost-Efficient LLM Serving with Ray Serve

Enabling Cost-Efficient LLM Serving with Ray Serve

vLLM: Easy, Fast, and Cheap LLM Serving, Woosuk Kwon, UC Berkeley

vLLM: Easy, Fast, and Cheap LLM Serving, Woosuk Kwon, UC Berkeley

NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service

NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service

[vLLM Office Hours #26] Intro to torch.compile and how it works with vLLM

[vLLM Office Hours #26] Intro to torch.compile and how it works with vLLM

Run A Local LLM Across Multiple Computers! (vLLM Distributed Inference)

Run A Local LLM Across Multiple Computers! (vLLM Distributed Inference)

What is vLLM? Efficient AI Inference for Large Language Models

What is vLLM? Efficient AI Inference for Large Language Models