Dissecting PaliGemma: Building a Vision-Language Model from Scratch in MLX
Автор: Josef Albers
Загружено: 2025-12-25
Просмотров: 34
In this video, we dive deep into the architecture of PaliGemma, Google’s open vision-language model, by walking through a custom implementation using Apple’s MLX framework. We dissect the code to understand how image features and text embeddings are fused to generate captions.
What we cover in this code walkthrough:
The Model Architecture: We look at the `PGemmaModel` class, which serves as the container for three distinct components: the `VisionModel`, the `LanguageModel`, and the `Projector`.
Vision Encoder: We explore how the model processes images using `VisionEmbeddings`, utilizing a standard convolutional layer (`Conv2d`) to create patch embeddings from the input image.
The "Glue" (Projector): See how the `Projector` class uses a linear layer to map the vision features into the same dimension as the text embeddings, allowing the language model to "understand" the image.
Weight Loading & Sanitization: I show you the logic required to download the pre-trained weights from Hugging Face (`google/paligemma-3b-mix-224`) and manually map specific keys—like renaming `vision_tower.vision_model`—to fit our MLX structure.
Multi-Modal Assembly: We break down the custom `assemble` function, which merges input text embeddings with image features and applies complex 4D attention masks to handle the different modalities.
Inference Loop: Finally, we run a generation loop that predicts tokens one by one using `mx.argmax` until an end-of-sequence token is reached, outputting a caption for a sample image.
Full Source Code: https://github.com/JosefAlbers/Phi-3-Visio...
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: