LLaDA-VLA: Vision Language Diffusion Action Models

Yuqing Wen1*,    Hebei Li1*,    Kefan Gu2*,    Yucheng Zhao3†,    Tiancai Wang3,    Xiaoyan Sun1‡
1University of Science and Technology of China,    2Nanjing University,    3Dexmal
*This work was done during the internship at Dexmal.    Project lead.    Corresponding Author.

We propose the first Vision-Language-Diffusion-Action model built upon pretrained diffusion Vision Language Model, achieving SOTA performance on both simulation and real-world settings and demonstrating strong generalization to unseen tasks.

Method

LLaDA-VLA Architecture

Overview of LLaDA-VLA. (a) Overall architecture. Visual features extracted by the vision encoder are projected into the text space and concatenated with text tokens. Together with masked tokens, they are fed into a large language diffusion model to generate action sequences via Localized Special-Token Classification and further refined with Hierarchical Action-Structured Decoding. (b) Hierarchical Action-Structured Decoding strategy. Starting from a fully masked action sequence (except vision and text prompts), the model iteratively predicts masked tokens, performing action-level and token-level remasking based on confidence until the full sequence is decoded.

State-Of-The-Art Performance

Performance on SimplerEnv Performance on CALVIN
Real-World Performance

Performance under varying data scales. It can be observed that LLaDA-VLA consistently outperforms previous methods on both SimplerEnv and CALVIN benchmark. On real-robot experiments, LLaDA-VLA achieves better performance than both CogACT and π₀.

Generalization Ability

Generalization to Unseen Tasks

Performance on unseen tasks. LLaDA-VLA exhibits strong generalization ability in real-robot OOD tasks, handling unseen objects, containers, and distractors, and achieving notably higher success rates than π₀.

BibTex

@article{
    @misc{wen2025lladavlavisionlanguagediffusion,
      title={LLaDA-VLA: Vision Language Diffusion Action Models}, 
      author={Yuqing Wen and Hebei Li and Kefan Gu and Yucheng Zhao and Tiancai Wang and Xiaoyan Sun},
      year={2025},
      eprint={2509.06932},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.06932}, 
}
}
                

Contact

Feel free to contact us at wenyuqing@mail.ustc.edu.cn or wangtiancai@megvii.com

web counter

Visitor Count