End-to-End (small) Vision Language Model Fine-tuning Tutorial | On DGX Spark

Автор: Daniel Bourke

Загружено: 2026-01-16

Просмотров: 768

Описание:

In this video we fine-tune Hugging Face's SmolVLM2-500M Vision Language Model do structured data extraction from images.

Because the SmolVLM2-500M model is is quite small in world of LLMs/VLMs, we're able to do all of the training locally on a NVIDIA DGX Spark (see here for more: https://nvda.ws/4iQXZU4).

The code should also run in Google Colab.

If you have any issues, please let me know in a comment.

Links:
Google Colab Notebook - https://colab.research.google.com/dri...
GitHub - https://github.com/mrdbourke/learn-hu...
Learn Hugging Face Book Version - https://www.learnhuggingface.com/note...
Dataset - https://huggingface.co/datasets/mrdbo...
Base model (SmolVLM2-500M) - https://huggingface.co/HuggingFaceTB/...
Demo - https://huggingface.co/spaces/mrdbour...

Livestreams (where I build this project from scratch):
Part 1: Creating a VLM dataset - https://www.youtube.com/live/cZVU559B...
Part 2: Fine-tuning a VLM with LoRA and QLoRA and getting many errors (mostly my fault) - https://www.youtube.com/live/Lgcp9hBq...
Part 3: Switching from using LoRA and QLoRA (we’ll do these in a future video) to fine-tuning a smaller model (SmolVLM2) successfully, uploading it to the Hugging Face Hub and then creating an publishing a demo - https://youtube.com/live/cZVU559BLLM?...

Courses I teach:
Learn AI/ML (beginner-friendly course) - https://dbourke.link/ZTMMLcourse
Learn Hugging Face - https://dbourke.link/ZTMHuggingFace
Learn TensorFlow - https://dbourke.link/ZTMTFcourse
Learn PyTorch - https://dbourke.link/ZTMPyTorch

Connect elsewhere:
Download Nutrify (my startup) - https://apple.co/4ahM7Wc
My website - https://www.mrdbourke.com
X/Twitter - / mrdbourke
LinkedIn - www.linkedin.com/in/mrdbourke
Get email updates on my work - https://dbourke.link/newsletter
Read my novel Charlie Walks - https://www.charliewalks.com

Timestamps:
00:00:00 - Introduction
00:02:19 - What is a VLM?
00:03:45 - Why fine-tune your own model?
00:06:05 - LLM fine-tuning mindset
00:06:51 - Case study: Nutrify
00:09:16 - Case study: Invoice Extractor
00:11:06 - Ingredients and tools we're going to use
00:12:16 - Exploring the FoodExtract-1k-Vision dataset
00:15:52 - My setup
00:16:13 - Dataset formatting for VLMs
00:16:54 - Dataset Creation for VLMs
00:17:20 - Getting a model to fine-tune
00:18:13 - Our task overview (structured data extraction from images)
00:20:11 - What we're going to end up with
00:22:38 - Code Starts
00:23:31 - Viewing a single data sample
00:29:08 - Splitting our data into train and test sets
00:34:25 - Inspecting our model's architecture
00:40:03 - Reading the recipe of the SmolDocling paper
00:45:29 - Freezing the vision encoder in our model
00:47:34 - Discussing batch sizes
00:49:06 - Setting up SFTConfig
00:52:03 - Training our model with SFTTrainer
00:54:11 - Model training starts
00:54:19 - Model training finishes
00:56:13 - Inspecting our model's loss curves
00:57:10 - Uploading our trained model to Hugging Face
00:58:19 - Model uploading to Hugging Face begins
00:58:26 - Model uploading finishes
00:59:38 - Comparing the base model to the fine-tuned model
01:01:06 - Viewing our fine-tuned model's first predictions
01:03:35 - Creating a demo with Gradio
01:06:46 - Uploading our demo to the Hugging Face Hub
01:07:35 - Trying out our demo
01:08:27 - What's next and extensions

End-to-End (small) Vision Language Model Fine-tuning Tutorial | On DGX Spark

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Распаковка, настройка и первые впечатления от NVIDIA DGX Spark — One plug AI.

Распаковка, настройка и первые впечатления от NVIDIA DGX Spark — One plug AI.

Доработайте свою степень магистра права за 13 минут. Вот как

Доработайте свою степень магистра права за 13 минут. Вот как

Microsoft begs for mercy

Microsoft begs for mercy

Она мастер спорта по боксу! Как тренируются лучшие девушки боксеры

Она мастер спорта по боксу! Как тренируются лучшие девушки боксеры

On evaluating LLMs: Let the errors emerge from the data | AI & ML Monthly

On evaluating LLMs: Let the errors emerge from the data | AI & ML Monthly

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

Клод Код вот-вот всё сломает

Клод Код вот-вот всё сломает

Кто пишет код лучше всех? Сравнил GPT‑5.2, Opus 4.5, Sonnet 4.5, Gemini 3, Qwen 3 Max, Kimi, GLM

Кто пишет код лучше всех? Сравнил GPT‑5.2, Opus 4.5, Sonnet 4.5, Gemini 3, Qwen 3 Max, Kimi, GLM

Цепи Маркова — математика предсказаний [Veritasium]

Цепи Маркова — математика предсказаний [Veritasium]

Teach LLM Something New 💡 LoRA Fine Tuning on Custom Data

Teach LLM Something New 💡 LoRA Fine Tuning on Custom Data

Превратите ЛЮБОЙ файл в знания LLM за СЕКУНДЫ

Превратите ЛЮБОЙ файл в знания LLM за СЕКУНДЫ

Почему RAG терпит неудачу — как CLaRa устраняет свой главный недостаток

Почему RAG терпит неудачу — как CLaRa устраняет свой главный недостаток

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

Теренс Тао о том, как Григорий Перельман решил гипотезу Пуанкаре | Лекс Фридман

Теренс Тао о том, как Григорий Перельман решил гипотезу Пуанкаре | Лекс Фридман

Николай ПЛАТОШКИН | Кто придумал “еврокоммунизм” и ЗАЧЕМ?

Николай ПЛАТОШКИН | Кто придумал “еврокоммунизм” и ЗАЧЕМ?

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

NVIDIA's new cuML framework speeds up Scikit-Learn by 50x | AI & ML Monthly

NVIDIA's new cuML framework speeds up Scikit-Learn by 50x | AI & ML Monthly

Fast Fine Tuning with Unsloth

Fast Fine Tuning with Unsloth

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

What does it take to build a Realistic RAG in 2025? | AI & ML Monthly

What does it take to build a Realistic RAG in 2025? | AI & ML Monthly