ResVLA

Abstract

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

From Noise to Intent: A Paradigm Shift

loss collapse residual diffusion

(a) As illustrated on the left, existing continuous generative policies typically adopt a "Generation-from-Noise" approach. The generation process starts from pure noise with no semantic prior (an uninformative, isotropic Gaussian distribution). Because the initial state is entirely independent of the task instructions, the model is forced to reconstruct the explicit global intent from scratch, resulting in a significantly long transport path. This blind exploration is not only computationally inefficient but also makes the optimization highly prone to "semantic drift" (a phenomenon known as "Loss Collapse"), where the model fails to align with fine-grained language instructions and generates trajectories that completely miss the target action manifold.

(b) To overcome these limitations, we propose a novel "Refinement-from-Intent" paradigm, shown on the right. Instead of generating actions ex nihilo, ResVLA starts from a condition-dependent intent anchor—a deterministic, low-frequency global trajectory structure directly predicted by the VLM. This mechanism provides a strong semantic prior before generation, effectively locking in the global spatial and semantic constraints. Consequently, a "Residual Diffusion Bridge" is established, allowing the generative model to refine residual dynamics only (i.e., the high-frequency execution details like contact adjustments). By transforming generation into a short-path refinement process, ResVLA dramatically improves inference efficiency, accelerates training convergence, and inherently prevents semantic drift.

Method

Directional Weight Score

Overview of the ResVLA framework. The architecture consists of two cascading stages: (1) Intent Anchoring: The Intent Anchoring Module leverages VLM features to regress the low-frequency component x_S, constructing a condition-dependent source p₀(x|c). (2) Residual Bridging: A flow matching expert learns the residual transport path (red arrow) from this anchor to the full action x_gt, focusing on refining high-frequency dynamics. ResVLA thus disentangles action generation into a hierarchical bridging problem: first anchoring low-frequency semantic intent, then refining high-frequency execution dynamics via flow matching.

Libero-Plus Metrics

Directional Weight Score

Robustness evaluation on the LIBERO-Plus benchmark. We report success rates (%) under various perturbations. The best, second-best, and third-best results are highlighted. For ResVLA, we report the performance gain (↑) or loss (↓) compared to the best performing baseline.

Simpler Metrics

SimplerEnv Google Robot Metrics SimplerEnv WidowX Bridge Metrics

Performance Comparison on SimplerEnv (Google Robot). Four tasks are evaluated: Pick Coke Can (PC), Move Near (MN), Open Drawer (OD), and Open Top Drawer (OTD). The best, second-best, and third-best results are highlighted (excluding baselines with spatial co-training). Results show that our ResVLA, learned entirely from scratch, achieves competitive performance.

Performance Comparison on SimplerEnv (WidowX/Bridge). We evaluate four tasks: Spoon on Towel (ST), Carrot on Plate (CP), Stack Blocks (SB), and Eggplant in Basket (EB). The best, second-best, and third-best results are highlighted. Results show our ResVLA achieves competitive performance, particularly in the Eggplant task, despite lacking large-scale pre-training.

Efficiency Results

Directional Weight Score

(a) Training convergence curves comparing success rates under varying dropout rates (p). (b) Inference analysis displaying success rates (bars) and inference time (lines) across different numbers of function evaluations (NFE).

BibTeX

@misc{zhong2026noiseintentanchoringgenerative,
      title={From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges}, 
      author={Yiming Zhong and Yaoyu He and Zemin Yang and Pengfei Tian and Yifan Huang and Qingqiu Huang and Xinge Zhu and Yuexin Ma},
      year={2026},
      eprint={2604.21391},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.21391}, 
      }

ResVLA: From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

Video Presentation and Real World Experiments

Abstract

From Noise to Intent: A Paradigm Shift

Method

Libero-Plus Metrics

Simpler Metrics

Efficiency Results

BibTeX