Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines... M. Kaushik, S.K. Merla
Автор: CNCF [Cloud Native Computing Foundation]
Загружено: 2024-11-16
Просмотров: 2011
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 - 4, 2025. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io
Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines on K8s - Meenakshi Kaushik & Shiva Krishna Merla, NVIDIA
In this session, we'll cover best practices for deploying, scaling, and managing LLM inference pipelines on Kubernetes (K8s). We'll explore common patterns like inference, retrieval-augmented generation (RAG), and fine-tuning. Key challenges addressed include:
[1]. Minimizing initial inference latency with model caching
[2] Optimizing GPU usage with efficient scheduling, multi-GPU/node handling, and auto-quantization
[3] Enhancing security and management with RBAC, monitoring, auto-scaling, and support for air-gapped clusters
We'll also demonstrate building customizable pipelines for inference, RAG, and fine-tuning, and managing them post-deployment. Solutions include
[1] a lightweight standalone tool built using operator pattern and
[2] KServe, a robust open-source AI inference platform. This session will equip you to effectively manage LLM inference pipelines on K8s, improving performance, efficiency, and security
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: