Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines... M. Kaushik, S.K. Merla

Автор: CNCF [Cloud Native Computing Foundation]

Загружено: 2024-11-16

Просмотров: 2011

Описание:

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 - 4, 2025. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io

Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines on K8s - Meenakshi Kaushik & Shiva Krishna Merla, NVIDIA

In this session, we'll cover best practices for deploying, scaling, and managing LLM inference pipelines on Kubernetes (K8s). We'll explore common patterns like inference, retrieval-augmented generation (RAG), and fine-tuning. Key challenges addressed include:
[1]. Minimizing initial inference latency with model caching
[2] Optimizing GPU usage with efficient scheduling, multi-GPU/node handling, and auto-quantization
[3] Enhancing security and management with RBAC, monitoring, auto-scaling, and support for air-gapped clusters

We'll also demonstrate building customizable pipelines for inference, RAG, and fine-tuning, and managing them post-deployment. Solutions include
[1] a lightweight standalone tool built using operator pattern and
[2] KServe, a robust open-source AI inference platform. This session will equip you to effectively manage LLM inference pipelines on K8s, improving performance, efficiency, and security

Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines... M. Kaushik, S.K. Merla

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar

Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

Why Physics Can’t Explain What Electricity Actually Is

Why Physics Can’t Explain What Electricity Actually Is

Efficient LLM Deployment: A Unified Approach with Ray, VLLM, and Kubernetes - Lily (Xiaoxuan) Liu

Efficient LLM Deployment: A Unified Approach with Ray, VLLM, and Kubernetes - Lily (Xiaoxuan) Liu

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Effici... Clayton Coleman, Jiaxin Shan

Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Effici... Clayton Coleman, Jiaxin Shan

The World's Most Important Machine

The World's Most Important Machine

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

⚡️ Россия срочно стягивает войка || США нанесли удар по Европе

⚡️ Россия срочно стягивает войка || США нанесли удар по Европе

Я в опасности

Accelerating LLM Inference with vLLM

Accelerating LLM Inference with vLLM

Линус Торвальдс рассказывает о шумихе вокруг искусственного интеллекта, мощности графических проц...

Линус Торвальдс рассказывает о шумихе вокруг искусственного интеллекта, мощности графических проц...

Бывший рекрутер Google объясняет, почему «ложь» помогает получить работу.

Бывший рекрутер Google объясняет, почему «ложь» помогает получить работу.

CLEANER Anatoly CHALLENGED BODYBUILDERS | GYM PRANK

CLEANER Anatoly CHALLENGED BODYBUILDERS | GYM PRANK

Освоение оптимизации вывода LLM: от теории до экономически эффективного внедрения: Марк Мойу

Освоение оптимизации вывода LLM: от теории до экономически эффективного внедрения: Марк Мойу

Introduction to Distributed ML Workloads with Ray on Kubernetes - Mofi Rahman & Abdel Sghiouar

Introduction to Distributed ML Workloads with Ray on Kubernetes - Mofi Rahman & Abdel Sghiouar

120 МИЛЛИАРДОВ: Зачем на САМОМ ДЕЛЕ был создан БИТКОИН? Тайна Сатоши Накамото

120 МИЛЛИАРДОВ: Зачем на САМОМ ДЕЛЕ был создан БИТКОИН? Тайна Сатоши Накамото

ВСЕ накопители ДАННЫХ: объясняю за 8 минут

ВСЕ накопители ДАННЫХ: объясняю за 8 минут

Production-Ready LLMs on Kubernetes: Patterns, Pitfalls, and Performa... Priya Samuel & Luke Marsden

Production-Ready LLMs on Kubernetes: Patterns, Pitfalls, and Performa... Priya Samuel & Luke Marsden

Run Your Own LLM in a Private Kubernetes Cluster (Access It From Anywhere)

Run Your Own LLM in a Private Kubernetes Cluster (Access It From Anywhere)