Scalable Inference Algorithms for Large Language Models | Woomin Song, KAIST | AER LABS
Автор: AER Labs
Загружено: 2026-01-08
Просмотров: 65
Scalable Inference Algorithms for LLMs: REFORM & STAND
In this presentation, Woomin Song introduces two training-free frameworks for efficient LLM inference: REFORM for long-context processing and STAND for accelerating test-time scaling.
Part 1: REFORM (NeurIPS 2025)
Learn how REFORM overcomes the quadratic computational cost of Transformer attention and KV cache memory bottlenecks. By combining Recurrent Chunking with On-Demand Cache Recomputation, REFORM achieves 75% accuracy on 1M-token Needle-In-A-Haystack benchmarks while significantly reducing latency and memory usage.
Part 2: STAND (EMNLP 2025)
Discover how STAND accelerates test-time scaling (chain-of-thought reasoning, majority voting, tree search) through model-free speculative decoding. By leveraging cross-trajectory n-gram overlaps and stochastic drafting, STAND achieves the same accuracy in under 40% of the decoding time.
Both works were conducted during the speaker's internship at Amazon.
Speaker: Woomin Song | Integrated M.S. + Ph.D. Student at KAIST
Affiliation: KAIST (Korea Advanced Institute of Science and Technology)
[Resume & Profile]
https://woominsong.github.io/
---
Timestamps:
[Part 1: REFORM - Long Context Processing]
[00:00] Introduction: Scalable Inference Algorithms for LLMs
[00:42] The Problem: Quadratic computational costs and KV cache bottlenecks
[01:52] The Challenge: Pre-trained context length limits
[02:18] Existing Solutions: Recurrent Compression (StreamingLLM, H2O)
[03:36] Existing Solutions: Random Access approaches and their limitations
[04:28] Introducing REFORM: Best of both worlds
[05:08] Key Observation: Attention heads as token selectors using cosine similarity
[05:52] Methodology Overview: Compress, Gather, and Recompute stages
[06:28] Step 1: Compress - Recurrent chunking with early exit strategy
[08:12] Handling KV Cache: Token eviction using attention scores
[08:52] Step 2: Gather - Cosine similarity search for relevant tokens
[09:16] Step 3: Recompute - Forwarding gathered inputs for generation
[09:32] Evaluation: Needle-In-A-Haystack (NIAH) benchmark results
[10:24] Synthetic Benchmarks: Comparison with InfLLM (23% vs 75% at 1M tokens)
[10:52] Realistic Benchmarks: InfiniteBench, RepoEval, and MM-NIAH results
[11:28] Efficiency Analysis: Inference time and peak GPU memory savings
[12:16] Comparison with RAG: Architecture-level advantages
[13:24] Ablation Studies: Compression strategies and head selection
[Part 2: STAND - Test-Time Scaling Acceleration]
[14:08] Introduction: Test-time scaling and the latency problem
[15:12] Background: Chain-of-thought, majority voting, and tree search
[16:32] The Research Problem: Speeding up without compromising accuracy
[17:04] Speculative Decoding: Draft-then-verify framework
[18:16] Key Observation: High n-gram overlap across reasoning trajectories
[19:08] Model-Free Drafters: Leveraging cross-trajectory information
[20:04] Stochastic vs Deterministic Drafting: Why sampling matters
[21:16] STAND Components: N-gram drafter with probability awareness
[22:08] Optimization Techniques: Gumbel top-k trick for faster sampling
[22:32] Tree Drafting: Optimizing tree structure for higher acceptance
[23:16] Evaluation: AIME 2024, GPQA Diamond, and LiveCodeBench results
[24:28] Results: Same accuracy in under 40% decoding time
[25:04] Batch Decoding Scenarios: STAND remains effective in parallel inference
[25:32] Ablation Studies: Contribution of stochastic drafting and tree optimization
[26:24] Key Finding: Deeper and narrower tree structures perform better
[26:52] Summary: N-gram based speculative decoding for test-time scaling
[Q&A Session]
[27:28] Q&A: How speculative decoding ensures output correctness
[31:04] Q&A: Greedy decoding vs sampling scenarios
[33:28] Q&A: Tree drafting explanation and benefits
[38:24] Q&A: Batch decoding and high-throughput inference scenarios
---
Hosted by AER Labs
#REFORM #STAND #KAIST #LLM #LongContext #SpeculativeDecoding #TestTimeScaling #DeepLearning #Transformer #Inference #AIResearch #NLP #MachineLearning #NeurIPS2025 #EMNLP2025TAND for accelerating test-time scaling.
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: