GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
arXiv, 2025

GeoAware-VLA injects geometric priors via a frozen VGGT backbone and a lightweight projection layer, enabling robust zero-shot generalization to novel viewpoints.

Abstract

Vision-Language-Action (VLA) models often struggle to generalize across camera viewpoints due to the challenge of inferring robust 3D geometry from 2D images. GeoAware-VLA enhances viewpoint invariance by using a frozen, pretrained geometric vision model (VGGT) as a feature extractor and a trainable projection layer that adapts these features for a BAKU-style policy decoder. Evaluated on LIBERO subsets, the method more than doubles zero-shot success on novel viewpoints in simulation and shows significant gains on a real robot, across both continuous and discrete action spaces.

Architecture

GeoAware-VLA architecture: VGGT backbone → feature projection → transformer observation trunk → action head

Frozen VGGT features feed a projection layer and a transformer observation trunk before the action head.

Results & Tasks

Composite: success-rate bar chart across LIBERO tasks and rollout frames

Quantitative Results

Table: Success rates (%) on LIBERO subsets comparing baselines vs. GeoAware variants

GeoAware variants improve zero-shot success on novel viewpoints across LIBERO subsets.

Poster / PDF

BibTeX

@article{abouzeid2025geoaware,
  title={GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model},
  author={Abouzeid, Ali and Mansour, Malak and Sun, Zezhou and Song, Dezhen},
  journal={arXiv preprint arXiv:2509.14117},
  year={2025},
  url={https://arxiv.org/abs/2509.14117}
}