Abstract
Vision-Language-Action (VLA) models often struggle to generalize across camera viewpoints due to the challenge of inferring robust 3D geometry from 2D images. GeoAware-VLA enhances viewpoint invariance by using a frozen, pretrained geometric vision model (VGGT) as a feature extractor and a trainable projection layer that adapts these features for a BAKU-style policy decoder. Evaluated on LIBERO subsets, the method more than doubles zero-shot success on novel viewpoints in simulation and shows significant gains on a real robot, across both continuous and discrete action spaces.
Architecture
Frozen VGGT features feed a projection layer and a transformer observation trunk before the action head.
Quantitative Results
GeoAware variants improve zero-shot success on novel viewpoints across LIBERO subsets.
Poster / PDF
BibTeX
@article{abouzeid2025geoaware,
title={GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model},
author={Abouzeid, Ali and Mansour, Malak and Sun, Zezhou and Song, Dezhen},
journal={arXiv preprint arXiv:2509.14117},
year={2025},
url={https://arxiv.org/abs/2509.14117}
}