AnyCamVLA:
Zero-Shot Camera Adaptation for Viewpoint
Robust Vision-Language-Action Models
Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments.
These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments.
In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification.
Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time.
For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters.
This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy.
Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input.
We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.
Detailed method descriptions, experiment results, demo videos, and code release will be available soon. Stay tuned!