We propose the first Vision-Language-Diffusion-Action model built upon pretrained diffusion Vision Language Model, achieving SOTA performance on both simulation and real-world settings and demonstrating strong generalization to unseen tasks.
Overview of LLaDA-VLA. (a) Overall architecture. Visual features extracted by the vision encoder are projected into the text space and concatenated with text tokens. Together with masked tokens, they are fed into a large language diffusion model to generate action sequences via Localized Special-Token Classification and further refined with Hierarchical Action-Structured Decoding. (b) Hierarchical Action-Structured Decoding strategy. Starting from a fully masked action sequence (except vision and text prompts), the model iteratively predicts masked tokens, performing action-level and token-level remasking based on confidence until the full sequence is decoded.
Performance under varying data scales. It can be observed that LLaDA-VLA consistently outperforms previous methods on both SimplerEnv and CALVIN benchmark. On real-robot experiments, LLaDA-VLA achieves better performance than both CogACT and π₀.
Performance on unseen tasks. LLaDA-VLA exhibits strong generalization ability in real-robot OOD tasks, handling unseen objects, containers, and distractors, and achieving notably higher success rates than π₀.
@article{ @misc{wen2025lladavlavisionlanguagediffusion, title={LLaDA-VLA: Vision Language Diffusion Action Models}, author={Yuqing Wen and Hebei Li and Kefan Gu and Yucheng Zhao and Tiancai Wang and Xiaoyan Sun}, year={2025}, eprint={2509.06932}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.06932}, } }
Feel free to contact us at wenyuqing@mail.ustc.edu.cn or wangtiancai@megvii.com