Dissecting PaliGemma: Building a Vision-Language Model from Scratch in MLX

Автор: Josef Albers

Загружено: 2025-12-25

Просмотров: 34

Описание:

In this video, we dive deep into the architecture of PaliGemma, Google’s open vision-language model, by walking through a custom implementation using Apple’s MLX framework. We dissect the code to understand how image features and text embeddings are fused to generate captions.

What we cover in this code walkthrough:

The Model Architecture: We look at the `PGemmaModel` class, which serves as the container for three distinct components: the `VisionModel`, the `LanguageModel`, and the `Projector`.
Vision Encoder: We explore how the model processes images using `VisionEmbeddings`, utilizing a standard convolutional layer (`Conv2d`) to create patch embeddings from the input image.
The "Glue" (Projector): See how the `Projector` class uses a linear layer to map the vision features into the same dimension as the text embeddings, allowing the language model to "understand" the image.
Weight Loading & Sanitization: I show you the logic required to download the pre-trained weights from Hugging Face (`google/paligemma-3b-mix-224`) and manually map specific keys—like renaming `vision_tower.vision_model`—to fit our MLX structure.
Multi-Modal Assembly: We break down the custom `assemble` function, which merges input text embeddings with image features and applies complex 4D attention masks to handle the different modalities.
Inference Loop: Finally, we run a generation loop that predicts tokens one by one using `mx.argmax` until an end-of-sequence token is reached, outputting a caption for a sample image.

Full Source Code: https://github.com/JosefAlbers/Phi-3-Visio...

Dissecting PaliGemma: Building a Vision-Language Model from Scratch in MLX

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Кто пишет код лучше всех? Сравнил GPT‑5.2, Opus 4.5, Sonnet 4.5, Gemini 3, Qwen 3 Max, Kimi, GLM

Кто пишет код лучше всех? Сравнил GPT‑5.2, Opus 4.5, Sonnet 4.5, Gemini 3, Qwen 3 Max, Kimi, GLM

Почему «Трансформеры» заменяют CNN?

Почему «Трансформеры» заменяют CNN?

ИИ - ЭТО ИЛЛЮЗИЯ ИНТЕЛЛЕКТА. Но что он такое и почему совершил революцию?

ИИ - ЭТО ИЛЛЮЗИЯ ИНТЕЛЛЕКТА. Но что он такое и почему совершил революцию?

What Are Vision Language Models? How AI Sees & Understands Images

What Are Vision Language Models? How AI Sees & Understands Images

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Я в опасности

Как происходит модернизация остаточных соединений [mHC]

Как происходит модернизация остаточных соединений [mHC]

ЧТО ЗА РАЛЬФ? Вечный ИИ-агент для кодинга и не только

ЧТО ЗА РАЛЬФ? Вечный ИИ-агент для кодинга и не только

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Тренды в ИИ 2026. К чему готовиться каждому.

Тренды в ИИ 2026. К чему готовиться каждому.

Diffusion Language Models: The Next Big Shift in GenAI

Diffusion Language Models: The Next Big Shift in GenAI

Топ-15 технологий, которые перевернут 2027 год

Топ-15 технологий, которые перевернут 2027 год

Проекционная матрица: от уничтожения до «Мамбы»

Проекционная матрица: от уничтожения до «Мамбы»

Пантеон: инженерная ошибка, которая пережила 2000 лет

Пантеон: инженерная ошибка, которая пережила 2000 лет

Почему нейросети постоянно врут? (и почему этого уже не исправить)

Почему нейросети постоянно врут? (и почему этого уже не исправить)

Топ-17 технологий, которые перевернут 2026 год

Топ-17 технологий, которые перевернут 2026 год

Вариационные автоэнкодеры | Генеративный ИИ-анимированный

Вариационные автоэнкодеры | Генеративный ИИ-анимированный

Что такое встраивание слов?

Что такое встраивание слов?