Speeding Up AI Quantization Techniques for Models and Vector DBs
Автор: Weaviate vector database
Загружено: 2025-03-26
Просмотров: 444
In this talk, Marcin Antas ( / antasmarcin , a senior Core Engineer who's been at @Weaviate for over 4 years, breaks down the essential techniques for optimizing AI models through quantization.
Learn how to significantly reduce the memory footprint of large language models and embedding models while preserving their functionality - even on constrained devices like Raspberry Pi 5!
🔑 Key Topics Covered:
LLM quantization techniques (from FP16/FP8 to 4-bit precision)
The GGUF format and LLAMA.cpp framework
Why feed-forward layer parameters are more sensitive than attention layers
Embedding model quantization using ONNX
Vector database quantization methods (Product, Binary, and Scalar)
Running vector databases and AI models on edge devices
This technical deep dive is perfect for developers looking to optimize AI models for memory-constrained environments or deploy vector search capabilities on edge devices.
Learn more from Weaviate at https://weaviate.io.

Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: