End-to-End (small) Vision Language Model Fine-tuning Tutorial | On DGX Spark
Автор: Daniel Bourke
Загружено: 2026-01-16
Просмотров: 768
In this video we fine-tune Hugging Face's SmolVLM2-500M Vision Language Model do structured data extraction from images.
Because the SmolVLM2-500M model is is quite small in world of LLMs/VLMs, we're able to do all of the training locally on a NVIDIA DGX Spark (see here for more: https://nvda.ws/4iQXZU4).
The code should also run in Google Colab.
If you have any issues, please let me know in a comment.
Links:
Google Colab Notebook - https://colab.research.google.com/dri...
GitHub - https://github.com/mrdbourke/learn-hu...
Learn Hugging Face Book Version - https://www.learnhuggingface.com/note...
Dataset - https://huggingface.co/datasets/mrdbo...
Base model (SmolVLM2-500M) - https://huggingface.co/HuggingFaceTB/...
Demo - https://huggingface.co/spaces/mrdbour...
Livestreams (where I build this project from scratch):
Part 1: Creating a VLM dataset - https://www.youtube.com/live/cZVU559B...
Part 2: Fine-tuning a VLM with LoRA and QLoRA and getting many errors (mostly my fault) - https://www.youtube.com/live/Lgcp9hBq...
Part 3: Switching from using LoRA and QLoRA (we’ll do these in a future video) to fine-tuning a smaller model (SmolVLM2) successfully, uploading it to the Hugging Face Hub and then creating an publishing a demo - https://youtube.com/live/cZVU559BLLM?...
Courses I teach:
Learn AI/ML (beginner-friendly course) - https://dbourke.link/ZTMMLcourse
Learn Hugging Face - https://dbourke.link/ZTMHuggingFace
Learn TensorFlow - https://dbourke.link/ZTMTFcourse
Learn PyTorch - https://dbourke.link/ZTMPyTorch
Connect elsewhere:
Download Nutrify (my startup) - https://apple.co/4ahM7Wc
My website - https://www.mrdbourke.com
X/Twitter - / mrdbourke
LinkedIn - www.linkedin.com/in/mrdbourke
Get email updates on my work - https://dbourke.link/newsletter
Read my novel Charlie Walks - https://www.charliewalks.com
Timestamps:
00:00:00 - Introduction
00:02:19 - What is a VLM?
00:03:45 - Why fine-tune your own model?
00:06:05 - LLM fine-tuning mindset
00:06:51 - Case study: Nutrify
00:09:16 - Case study: Invoice Extractor
00:11:06 - Ingredients and tools we're going to use
00:12:16 - Exploring the FoodExtract-1k-Vision dataset
00:15:52 - My setup
00:16:13 - Dataset formatting for VLMs
00:16:54 - Dataset Creation for VLMs
00:17:20 - Getting a model to fine-tune
00:18:13 - Our task overview (structured data extraction from images)
00:20:11 - What we're going to end up with
00:22:38 - Code Starts
00:23:31 - Viewing a single data sample
00:29:08 - Splitting our data into train and test sets
00:34:25 - Inspecting our model's architecture
00:40:03 - Reading the recipe of the SmolDocling paper
00:45:29 - Freezing the vision encoder in our model
00:47:34 - Discussing batch sizes
00:49:06 - Setting up SFTConfig
00:52:03 - Training our model with SFTTrainer
00:54:11 - Model training starts
00:54:19 - Model training finishes
00:56:13 - Inspecting our model's loss curves
00:57:10 - Uploading our trained model to Hugging Face
00:58:19 - Model uploading to Hugging Face begins
00:58:26 - Model uploading finishes
00:59:38 - Comparing the base model to the fine-tuned model
01:01:06 - Viewing our fine-tuned model's first predictions
01:03:35 - Creating a demo with Gradio
01:06:46 - Uploading our demo to the Hugging Face Hub
01:07:35 - Trying out our demo
01:08:27 - What's next and extensions
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: