Veo 3 + DeepSeek V3.2-Exp Explained: Sparse Attention for Affordable Long Contexts
Автор: Latent Space TV (see @LatentSpacePod for Pod)
Загружено: 2025-10-02
Просмотров: 119
Two main topics: an analysis of Google's Veo 3 video model and DeepSeek's new experimental V3.1 model. The first part of the discussion centers on a paper analyzing Veo 3, arguing that video models are emerging as *general-purpose foundation models* similar to large language models (LLMs). The paper explores Veo 3's emergent zero-shot capabilities across perception, modeling, manipulation, and early forms of visual reasoning. However, the speaker expresses skepticism about the extent of its "reasoning" capabilities, suggesting that an underlying LLM prompt rewriter might be contributing to the perceived reasoning. The second part introduces DeepSeek's V3.1 experimental model, which focuses on *sparse attention* to significantly reduce inference costs for long context scenarios, making it much cheaper to serve at longer input sequence lengths without a substantial drop in performance.
Timestamps
00:00 Introduction to the Veo 3 analysis paper and initial skepticism
00:58 Premise that Veo 3 can act as an LLM for video, performing general reasoning and tasks
02:09 Comparison to Sora 1 and Sora 2, and a discussion on world models
03:23 The paper's claim of Veo 3's reasoning capabilities, including maze solving, and the potential influence of an LLM prompt rewriter
04:36 Discussion on general-purpose vision understanding through large-scale training
06:05 Demonstration of Veo 3's capabilities on a web page, covering perception, modeling, manipulation, and reasoning
07:52 Skepticism regarding the true source of "reasoning" (LLM vs. video model)
09:28 Quantitative results and comparison to other models, including the use of green backgrounds for better performance
11:11 How Google attempts to isolate the video model's reasoning capabilities from the LLM rewriter
11:46 Overview of the four hierarchical capabilities: perception, modeling, manipulation, and reasoning
13:36 Detailed look at perception tasks (edge detection, segmentation) and the claim that video models will replace bespoke CV models
15:06 Discussion on modeling physical properties and optical phenomena, and manipulation tasks like background removal
15:41 Visual reasoning and the concept of "chain of frames" as analogous to "chain of thought"
17:09 Quantitative tasks, performance metrics, and comparison to Veo 2 and Nano Banana
18:01 Detailed analysis of specific quantitative tasks like edge detection, object extraction, and segmentation, highlighting the green background bias
20:26 Discussion on maze solving, image editing, and visual symmetry solving
21:57 Discussion on Veo 3's emergent zero-shot abilities and its role as a foundation model for machine vision
23:06 Framing the paper's outlook and the benefits of general capabilities
24:12 Recap of the Veo 3 paper as an analysis rather than a technical detail paper
25:19 Discussion on the speaker's skepticism about the paper's claims and the importance of capability exploration
25:40 Question about dollar-per-token cost and the comparison between specialized models and foundation models
26:10 Question on what "true reasoning" would look like in video models without an LLM rewriter
27:52 Discussion on Sora 2's system card and the absence of quantitative metrics
28:36 OpenAI's approach to system cards and safety measures for Sora 2, including moderation classifiers and output blocking
30:37 Transparency, watermarking, and internal detection tools for AI-generated content
32:02 Discussion on control nets and the limitations of API-based models
33:51 Follow-up on the dollar cost question and the future of specialized vs. generalist models
36:09 Question on how text prompts are integrated into the latent space of vision models
38:19 Introduction to the DeepSeek V3.1 Experimental paper
38:52 DeepSeek V3.1's focus on *reducing inference cost* for long contexts using sparse attention
39:54 Explanation of the *sparse attention mechanism* with a lightning indexer and fine-grained token selection
41:02 How the indexer computes an index score to select top-k tokens, and the role of its smaller size for computational efficiency
47:08 Two stages of pre-training: dense warm-up (for indexer initialization) and sparse attention mechanism training
48:51 Post-training with two modifications: *specialist distillation* and *mixed RL training*
49:23 Specialist distillation using expert models in mathematics, competitive programming, logical reasoning, agentic coding, and agentic search
50:52 Mixed RL training to balance performance across tasks and prevent catastrophic forgetting
52:22 Evaluations showing the efficiency and power of the sparse attention implementation, with cost reduction
54:10 The potential for other models to adapt this sparse attention technique due to continued training feasibility
54:49 Discussion on slight performance drops in some benchmarks for DeepSeek V3.1 but significant cost savings

Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: