Nvidia NEW Multimodal LLM - Describe Anything Model (DAM)
Автор: DSAI by Dr. Osbert Tay
Загружено: 2025-04-23
Просмотров: 358
🚀 We’re excited to introduce the Describe Anything Model (DAM) — a powerful Multimodal LLM that generates detailed descriptions for user-defined regions in images and videos using points, boxes, scribbles, or masks.
🔗 We open-source code, models, demo, data, and benchmark for research purposes at: https://describe-anything.github.io/
💡 DAM powers the Detailed Localized Captioning (DLC) task, which goes beyond standard captioning. Instead of summarizing the whole scene, DLC focuses on specific regions — capturing fine details like texture, color, shape, and distinctive features.
📽️ DLC naturally extends to videos, describing how a region’s appearance and context evolve over time.
🔍 Key to this is our Focal Prompt mechanism, showing both the full image and a zoomed-in view of the target area. This enables detailed and context-aware captioning.
🌟 Under the hood, we use a localized vision backbone combining global and focal features, gated cross-attention layers aligning and fusing the information.
📊 Data matters. Since existing datasets lack localized detail, we built a two-stage data pipeline:
1. Use a VLM to expand class and mask labels into rich, localized descriptions.
2. Apply self-training to generate and refine captions on unlabeled data.
This scalable method enables us to build a high-quality training set without heavy human annotation.
📏 Our benchmark DLC-Bench features an LLM-based judge that evaluates region-based descriptions for correctness and accuracy.
📈 Results? Our method outperforms API-only, open-source, and region-specific VLMs across detailed localized captioning tasks.
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: