LLaDA-VLA: Vision Language Diffusion Action Models

Yuqing Wen^1*, Hebei Li^1*, Kefan Gu^2*, Yucheng Zhao^3†, Tiancai Wang³, Xiaoyan Sun^1‡

¹University of Science and Technology of China, ²Nanjing University, ³Dexmal

^*This work was done during the internship at Dexmal. ^†Project lead. ^‡Corresponding Author.

We propose the first Vision-Language-Diffusion-Action model built upon pretrained diffusion Vision Language Model, achieving SOTA performance on both simulation and real-world settings and demonstrating strong generalization to unseen tasks.

Method

Overview of LLaDA-VLA. (a) Overall architecture. Visual features extracted by the vision encoder are projected into the text space and concatenated with text tokens. Together with masked tokens, they are fed into a large language diffusion model to generate action sequences via Localized Special-Token Classification and further refined with Hierarchical Action-Structured Decoding. (b) Hierarchical Action-Structured Decoding strategy. Starting from a fully masked action sequence (except vision and text prompts), the model iteratively predicts masked tokens, performing action-level and token-level remasking based on confidence until the full sequence is decoded.

State-Of-The-Art Performance

Performance under varying data scales. It can be observed that LLaDA-VLA consistently outperforms previous methods on both SimplerEnv and CALVIN benchmark. On real-robot experiments, LLaDA-VLA achieves better performance than both CogACT and π₀.

Generalization Ability

Performance on unseen tasks. LLaDA-VLA exhibits strong generalization ability in real-robot OOD tasks, handling unseen objects, containers, and distractors, and achieving notably higher success rates than π₀.

BibTex

@article{
    @misc{wen2025lladavlavisionlanguagediffusion,
      title={LLaDA-VLA: Vision Language Diffusion Action Models}, 
      author={Yuqing Wen and Hebei Li and Kefan Gu and Yucheng Zhao and Tiancai Wang and Xiaoyan Sun},
      year={2025},
      eprint={2509.06932},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.06932}, 
}
}

Contact

Feel free to contact us at wenyuqing@mail.ustc.edu.cn or wangtiancai@megvii.com

Visitor Count