Scalable Inference Algorithms for Large Language Models | Woomin Song, KAIST | AER LABS

Автор: AER Labs

Загружено: 2026-01-08

Просмотров: 65

Описание:

Scalable Inference Algorithms for LLMs: REFORM & STAND

In this presentation, Woomin Song introduces two training-free frameworks for efficient LLM inference: REFORM for long-context processing and STAND for accelerating test-time scaling.

Part 1: REFORM (NeurIPS 2025)
Learn how REFORM overcomes the quadratic computational cost of Transformer attention and KV cache memory bottlenecks. By combining Recurrent Chunking with On-Demand Cache Recomputation, REFORM achieves 75% accuracy on 1M-token Needle-In-A-Haystack benchmarks while significantly reducing latency and memory usage.

Part 2: STAND (EMNLP 2025)
Discover how STAND accelerates test-time scaling (chain-of-thought reasoning, majority voting, tree search) through model-free speculative decoding. By leveraging cross-trajectory n-gram overlaps and stochastic drafting, STAND achieves the same accuracy in under 40% of the decoding time.

Both works were conducted during the speaker's internship at Amazon.
Speaker: Woomin Song | Integrated M.S. + Ph.D. Student at KAIST
Affiliation: KAIST (Korea Advanced Institute of Science and Technology)

[Resume & Profile]
https://woominsong.github.io/
---
Timestamps:
[Part 1: REFORM - Long Context Processing]
[00:00] Introduction: Scalable Inference Algorithms for LLMs
[00:42] The Problem: Quadratic computational costs and KV cache bottlenecks
[01:52] The Challenge: Pre-trained context length limits
[02:18] Existing Solutions: Recurrent Compression (StreamingLLM, H2O)
[03:36] Existing Solutions: Random Access approaches and their limitations
[04:28] Introducing REFORM: Best of both worlds
[05:08] Key Observation: Attention heads as token selectors using cosine similarity
[05:52] Methodology Overview: Compress, Gather, and Recompute stages
[06:28] Step 1: Compress - Recurrent chunking with early exit strategy
[08:12] Handling KV Cache: Token eviction using attention scores
[08:52] Step 2: Gather - Cosine similarity search for relevant tokens
[09:16] Step 3: Recompute - Forwarding gathered inputs for generation
[09:32] Evaluation: Needle-In-A-Haystack (NIAH) benchmark results
[10:24] Synthetic Benchmarks: Comparison with InfLLM (23% vs 75% at 1M tokens)
[10:52] Realistic Benchmarks: InfiniteBench, RepoEval, and MM-NIAH results
[11:28] Efficiency Analysis: Inference time and peak GPU memory savings
[12:16] Comparison with RAG: Architecture-level advantages
[13:24] Ablation Studies: Compression strategies and head selection
[Part 2: STAND - Test-Time Scaling Acceleration]
[14:08] Introduction: Test-time scaling and the latency problem
[15:12] Background: Chain-of-thought, majority voting, and tree search
[16:32] The Research Problem: Speeding up without compromising accuracy
[17:04] Speculative Decoding: Draft-then-verify framework
[18:16] Key Observation: High n-gram overlap across reasoning trajectories
[19:08] Model-Free Drafters: Leveraging cross-trajectory information
[20:04] Stochastic vs Deterministic Drafting: Why sampling matters
[21:16] STAND Components: N-gram drafter with probability awareness
[22:08] Optimization Techniques: Gumbel top-k trick for faster sampling
[22:32] Tree Drafting: Optimizing tree structure for higher acceptance
[23:16] Evaluation: AIME 2024, GPQA Diamond, and LiveCodeBench results
[24:28] Results: Same accuracy in under 40% decoding time
[25:04] Batch Decoding Scenarios: STAND remains effective in parallel inference
[25:32] Ablation Studies: Contribution of stochastic drafting and tree optimization
[26:24] Key Finding: Deeper and narrower tree structures perform better
[26:52] Summary: N-gram based speculative decoding for test-time scaling
[Q&A Session]
[27:28] Q&A: How speculative decoding ensures output correctness
[31:04] Q&A: Greedy decoding vs sampling scenarios
[33:28] Q&A: Tree drafting explanation and benefits
[38:24] Q&A: Batch decoding and high-throughput inference scenarios

---
Hosted by AER Labs

#REFORM #STAND #KAIST #LLM #LongContext #SpeculativeDecoding #TestTimeScaling #DeepLearning #Transformer #Inference #AIResearch #NLP #MachineLearning #NeurIPS2025 #EMNLP2025TAND for accelerating test-time scaling.

Scalable Inference Algorithms for Large Language Models | Woomin Song, KAIST | AER LABS

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Determinism and Scalability in Post-Training RL Systems | Ethan Su | AER LABS

Determinism and Scalability in Post-Training RL Systems | Ethan Su | AER LABS

OML : AI-native Cryptography for Open-Model Attribution and Control | Edoardo Contente | AER LABS

OML : AI-native Cryptography for Open-Model Attribution and Control | Edoardo Contente | AER LABS

Optimizing Large-Scale RL with SGLang | Chenyang Zhao | AER Labs

Optimizing Large-Scale RL with SGLang | Chenyang Zhao | AER Labs

Keynote Talks

NVIDIA Dynamo: High performance Open Source Interface | William Arnold | AER Labs

NVIDIA Dynamo: High performance Open Source Interface | William Arnold | AER Labs

Lectures and Seminars

Lectures and Seminars

Unlocking Geometry with InstaFormer | Pierre Musacchio, SNU | AER LABS

Unlocking Geometry with InstaFormer | Pierre Musacchio, SNU | AER LABS

Q&A with Stefan Hell: Being a Scientist

Q&A with Stefan Hell: Being a Scientist

[Scheduling seminar] Hyun-Jung Kim (KAIST) | Scheduling with Machine Learning

[Scheduling seminar] Hyun-Jung Kim (KAIST) | Scheduling with Machine Learning

No One Understands What Elon Just Said About 2026

No One Understands What Elon Just Said About 2026

Scheduling seminar Winter 2023

Scheduling seminar Winter 2023

New Google Antigravity AI Agent Update is INSANE!

New Google Antigravity AI Agent Update is INSANE!

Floating Point Non Associativity in Machine Learning | Brian Chau | AER Labs

Floating Point Non Associativity in Machine Learning | Brian Chau | AER Labs

DNABERT pre-trained Bidirectional Encoder Representations Transformers for DNA-language in genome

DNABERT pre-trained Bidirectional Encoder Representations Transformers for DNA-language in genome

[KAIST Emerging Materials e-Symposium] Zhong Lin Wang

[KAIST Emerging Materials e-Symposium] Zhong Lin Wang

Mrozu feat. Julia Pietrucha - Anioły (Pojedynek - official promo video)

Mrozu feat. Julia Pietrucha - Anioły (Pojedynek - official promo video)

Understanding a High Throughput LLM Inference System | Ayush Satyam | AER Labs

Understanding a High Throughput LLM Inference System | Ayush Satyam | AER Labs

44th WHC Side Event - World Heritage with Multiple Memories, the Role of Heritage Interpretation

44th WHC Side Event - World Heritage with Multiple Memories, the Role of Heritage Interpretation

[The 3rd KAIST Emerging Materials e-Symposium] Zhenan Bao (Stanford)

[The 3rd KAIST Emerging Materials e-Symposium] Zhenan Bao (Stanford)

BioReason: Biological Reasoning within a DNA-LLM Model | Adib Fallahpour | HMAI Speaker Series #5

BioReason: Biological Reasoning within a DNA-LLM Model | Adib Fallahpour | HMAI Speaker Series #5