Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

Accepted by IEEE Robotics and Automation Letters (RAL) 2026
1The Chinese University of Hong Kong, 2Istituto Italiano di Tecnologia

We propose VLA-Pilot, an novel plug-and-play inference-time policy steering method that enables zero-shot deployment of pre-trained VLA policies without any additional fine-tuning or data collection.

Abstract

Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments.

VLA-Pilot Overview

Insights

A common approach to mitigate deployment failures of pre-trained VLA models is fine-tuning with task-specific data. While effective, this strategy is impractical in real-world applications due to the high cost of data collection and computational resources, as well as the risk of compromising the generalist capabilities of the pre-trained policies. In fact, such deployment failures do not necessarily indicate that the pre-trained VLA policy is incapable of generating the correct behavior. The desired behavior mode may already exist within the policy's generative distribution, but due to suboptimal mode selection at runtime, it fails to be executed reliably.

Insights figure
Figure: Key insights behind VLA-Pilot — pre-trained VLA action proposals cluster on a meaningful manifold, and LLM-derived rewards provide training-free task alignment.

Method Overview

Given a task context, VLA-Pilot steers a pre-trained VLA policy at inference time through three tightly integrated steps:

① Steering Objective Reasoning (EPS-CoT). The Embodied Policy Steering Chain-of-Thought (EPS-CoT) module takes the task language instruction and current visual observation as input. It reasons step-by-step to produce a structured, task-aligned reward function that encodes what "good" action execution looks like for the given task — without any human annotation or demonstration.

② Action Proposal Optimization (Evolutionary Diffusion). A population of candidate action proposals is sampled from the pre-trained VLA's diffusion process. Each proposal is scored by the EPS-CoT reward, and an evolutionary selection-and-recombination procedure iteratively refines the population over multiple generations, ultimately executing the highest-scoring action.

③ Iterative Steering Refinement. After each execution step, the EPS-CoT module receives post-execution feedback (visual state, task progress) and updates the steering objective accordingly. This closed-loop refinement loop allows VLA-Pilot to recover from errors and adapt to task-specific dynamics over a full manipulation episode.

VLA-Pilot Method Overview
Figure: VLA-Pilot framework. EPS-CoT reasons a task-aligned reward; Evolutionary Diffusion optimizes action proposals; iterative refinement closes the loop.

Qualitative Results

VLA-Pilot effectively steers off-the-shelf pre-trained VLA policies to complete downstream tasks at inference time, achieving zero-shot deployment across both ID and OOD task scenarios.

VLA-Pilot Overview

Quantative Results

VLA-Pilot outperforms all baselines, demonstrating superiority in steering pre-trained VLA policies for downstream task execution. Specifically, VLA-Pilot consistently enhances the performance of both pre-trained VLA policies in all six downstream tasks, achieving average improvements of +0.31 MSR for DiVLA and +0.30 MSR for RDT-1B.

VLA-Pilot Overview

Conclusion

In this paper, we presented VLA-Pilot, an inference-time policy steering method that enables zero-shot deployment of pre-trained VLA models without any fine-tuning. Both simulation and real-world experiments validate its effectiveness and highlight its potential as a universal and modular plug-in for aligning generalist VLA policies with diverse downstream task goals.

Limitations

Despite its effectiveness, VLA-Pilot has several limitations:

  • Diffusion-based architecture dependency. VLA-Pilot assumes that the underlying VLA policy supports noise-conditioned sampling, limiting its applicability to diffusion-based architectures.
  • Inference-time latency. The reliance on MLLMs introduces non-trivial inference-time latency.
  • Keypoint-based reward grounding. The current reward evaluation paradigm depends on keypoint-based vision grounding, which may be brittle in tasks involving delayed effects or deformable object interactions.

Future directions include extending the steering paradigm to broader VLA architectures, optimizing MLLM integration via quantization or caching strategies, and improving robustness by incorporating richer embodied feedback beyond keypoint-level signals.

Q & A

Does VLA-Pilot require any task-specific training or demonstrations?
No. VLA-Pilot is entirely training-free. It operates at inference time on top of any pre-trained VLA policy without requiring new demonstrations, gradient updates, or model retraining. The EPS-CoT reward is reasoned from the task language instruction and visual observation alone.
Which pre-trained VLA models are compatible with VLA-Pilot?
VLA-Pilot is designed to be model-agnostic. In our experiments we evaluated it with DiVLA and RDT-1B. Any VLA policy that (1) uses a diffusion-based action head and (2) accepts language and vision inputs can in principle be plugged into the VLA-Pilot framework.
How does VLA-Pilot compare to fine-tuning approaches?
On out-of-distribution tasks, VLA-Pilot achieves performance competitive with fine-tuning methods trained on 50 demonstrations, while requiring zero demonstrations and no computation beyond inference. Fine-tuning still holds an advantage on highly specialized tasks with large distribution shifts, but VLA-Pilot dramatically reduces the deployment barrier for new tasks.
What is the computational overhead of Evolutionary Diffusion?
Evolutionary Diffusion requires sampling a population of candidate actions and running multiple scoring iterations per control step. In our experiments, this adds approximately 2–4× the base VLA inference time per step. We believe this overhead can be significantly reduced with parallel hardware execution and smaller population sizes, which we leave to future work.
Can VLA-Pilot handle out-of-distribution (OOD) tasks?
Yes. We specifically evaluate VLA-Pilot on OOD tasks — tasks not seen during VLA pretraining — and observe robust generalization. The EPS-CoT module's language-driven reward reasoning generalizes across task types, and the evolutionary optimization ensures the executed action aligns with the task goal even when the base VLA policy struggles.

BibTeX

@article{li2025towards,
  title={Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion},
  author={Li, Zhuo and Liu, Junjia and Dong, Zhipeng and Teng, Tao and Rouxel, Quentin and Caldwell, Darwin and Chen, Fei},
  journal={arXiv preprint arXiv:2511.14178},
  year={2025}
}