Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
dTub
Скачать

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Автор: Latent Space

Загружено: 2025-12-18

Просмотров: 1919

Описание:

as with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)
From SAM 1's 11-million-image data engine to SAM 2's memory-based video tracking, MSL’s Segment Anything project has redefined what's possible in computer vision. Now SAM 3 takes the next leap: *concept segmentation*—prompting with natural language like "yellow school bus" or "tablecloth" to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio (https://x.com/aiatmeta/status/2000980..., SAM can now even segment audio output!
We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.
We discuss:

What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like "purple umbrella" or "watering can"
How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly
Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and "fast mode" tracking
The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity
The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2
Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale
Architecture innovations: presence token to separate recognition ("is it in the image?") from localization ("where is it?"), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking
Building on Meta's ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2's memory-based tracking backbone
SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like "find the bigger character" or "what distinguishes male from female in this image"
Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples
Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more

—
MSL FAIR team

Nikhila: \
Pengchuan: https://pzzhang.github.io/pzzhang/

Joseph Nelson

X: \
LinkedIn: \

[FLIGHTCAST_CHATPERS]

SAM 3: The Eyes for AI  — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Поделиться в:

Доступные форматы для скачивания:

Скачать видео mp4

  • Информация по загрузке:

Скачать аудио mp3

Похожие видео

Steve Yegge's Vibe Coding Manifesto: Why Claude Code Isn't It & What Comes After the IDE

Steve Yegge's Vibe Coding Manifesto: Why Claude Code Isn't It & What Comes After the IDE

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Google DeepMind CEO Demis Hassabis on AI, Creativity, and a Golden Age of Science | All-In Summit

Google DeepMind CEO Demis Hassabis on AI, Creativity, and a Golden Age of Science | All-In Summit

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Filipino AI Startup Transforming Exam Prep & Learning: Real Talk on Education & Parenting

Filipino AI Startup Transforming Exam Prep & Learning: Real Talk on Education & Parenting

⚡️ Building the AI Hardware Engineer with Matthias Wagner, Co-founder of Flux

⚡️ Building the AI Hardware Engineer with Matthias Wagner, Co-founder of Flux

Building the PERFECT Linux PC with Linus Torvalds

Building the PERFECT Linux PC with Linus Torvalds

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

То, что они только что построили, — нереально

То, что они только что построили, — нереально

Why We Need New AI Benchmarks, Which Industries Survive AI, and Recursive Learning Timelines | #218

Why We Need New AI Benchmarks, Which Industries Survive AI, and Recursive Learning Timelines | #218

Google DeepMind’s Demis Hassabis with Axios’ Mike Allen

Google DeepMind’s Demis Hassabis with Axios’ Mike Allen

AI for science | Sir Paul Nurse, Demis Hassabis, Jennifer Doudna, and John Jumper

AI for science | Sir Paul Nurse, Demis Hassabis, Jennifer Doudna, and John Jumper

Electrons Don't Actually Orbit Like This

Electrons Don't Actually Orbit Like This

Прогноз Сергея Гуриева. Что будет с войной, экономикой и россиянами в 2026?

Прогноз Сергея Гуриева. Что будет с войной, экономикой и россиянами в 2026?

The Great Evals Debate — Ankur Goyal & Malte Ubl

The Great Evals Debate — Ankur Goyal & Malte Ubl

DeepMind CEO Demis Hassabis + Google Co-Founder Sergey Brin: AGI by 2030?

DeepMind CEO Demis Hassabis + Google Co-Founder Sergey Brin: AGI by 2030?

The future of intelligence | Demis Hassabis (Co-founder and CEO of DeepMind)

The future of intelligence | Demis Hassabis (Co-founder and CEO of DeepMind)

Чем ОПАСЕН МАХ? Разбор приложения специалистом по кибер безопасности

Чем ОПАСЕН МАХ? Разбор приложения специалистом по кибер безопасности

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Huge Breakthrough: We're Beyond Silicon

Huge Breakthrough: We're Beyond Silicon

© 2025 dtub. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]