PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

Zhuo Li¹, Junjia Liu¹, Zhipeng Dong ¹ Tao Teng¹, Quentin Rouxel¹ Darwin Caldwell² Fei Chen^1*

¹The Chinese University of Hong Kong, ²Istituto Italiano di Tecnologia

Paper Code (Coming soon)

We propose VLA-Pilot, an novel plug-and-play inference-time policy steering method that enables zero-shot deployment of pre-trained VLA policies without any additional fine-tuning or data collection.

Abstract

Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments.

Method

Given a task context, VLA-Pilot steers a pre-trained VLA policy at inference-time via three key steps: 1) Steering Objective Reasoning employs EPS-CoT module to reason a task-aligned steering objective reward from the given task context; 2) Action Proposal Optimization leverages Evolutionary Diffusion to score and optimize action proposals from the pre-trained VLA based on the reasoned objective reward, and executes the highest-scoring proposal; 3) Iterative Steering Refinement integrates post-execution reflection into the EPS-CoT reasoning loop, enabling closed-loop refinement for improved steering accuracy and robustness.

Qualitative Results

VLA-Pilot effectively steers off-the-shelf pre-trained VLA policies to complete downstream tasks at inference time, achieving zero-shot deployment across both ID and OOD task scenarios.

Quantative Results

VLA-Pilot outperforms all baselines, demonstrating superiority in steering pre-trained VLA policies for downstream task execution. Specifically, VLA-Pilot consistently enhances the performance of both pre-trained VLA policies in all six downstream tasks, achieving average improvements of +0.31 MSR for DiVLA and +0.30 MSR for RDT-1B.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}

More Works from Our Lab

SayFuncGrasp

HOTU, a Human–Humanoid Robots’ Skill Transfer Framework

ManiDP