LLaDA-VLA: Vision Language Diffusion Action Models (Wen et al., arXiv 2509)
Автор: AIDAS Lab
Загружено: 2025-10-21
Просмотров: 223
Recent advances in Vision-Language-Action (VLA) models have shown strong performance in robotic control, typically using vision-language models (VLMs) as backbones and diffusion policies for robot action generation. Building on this progress, LLaDA-VLA is the first model to construct a VLA framework using a masked diffusion model (MDM) instead of traditional autoregressive architectures. It employs a masked diffusion process to predict and iteratively refine actions in parallel, introducing two key innovations: Localized Special-Token Classification, which focuses learning on discrete action tokens, and Hierarchical Action-Structured Decoding, which ensures coherent multi-step trajectory generation. Based on LLaDA-V and a SigLIP-2 vision encoder, the model translates text and image inputs into 7-DoF robot actions. Experiments on SimplerEnv, CALVIN, and WidowX robots show substantial gains over previous VLAs, establishing diffusion-based language models as a new paradigm for robotic manipulation.
Presenter: Hoeun Lee
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: