We propose the first Vision-Language-Diffusion-Action model built upon pretrained diffusion Vision Language Model, achieving SOTA performance on both simulation and real-world settings and demonstrating strong generalization to unseen tasks.
Overview of LLaDA-VLA. (a) Overall architecture. Visual features extracted by the vision encoder are projected into the text space and concatenated with text tokens. Together with masked tokens, they are fed into a large language diffusion model to generate action sequences via Localized Special-Token Classification and further refined with Hierarchical Action-Structured Decoding. (b) Hierarchical Action-Structured Decoding strategy. Starting from a fully masked action sequence (except vision and text prompts), the model iteratively predicts masked tokens, performing action-level and token-level remasking based on confidence until the full sequence is decoded.
Performance under varying data scales. It can be observed that LLaDA-VLA consistently outperforms previous methods on both SimplerEnv and CALVIN benchmark. On real-robot experiments, LLaDA-VLA achieves better performance than both CogACT and π₀.
Performance on unseen tasks. LLaDA-VLA exhibits strong generalization ability in real-robot OOD tasks, handling unseen objects, containers, and distractors, and achieving notably higher success rates than π₀.
@article{
@misc{wen2025lladavlavisionlanguagediffusion,
title={LLaDA-VLA: Vision Language Diffusion Action Models},
author={Yuqing Wen and Hebei Li and Kefan Gu and Yucheng Zhao and Tiancai Wang and Xiaoyan Sun},
year={2025},
eprint={2509.06932},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.06932},
}
}
Feel free to contact us at wenyuqing@mail.ustc.edu.cn or wangtiancai@megvii.com