Abstract:
We introduce , a feed-forward neural network that offers a
novel approach to visual geometry reconstruction, breaking
the reliance on a conventional fixed reference view. Previous
methods often anchor their reconstructions to a designated
viewpoint, an inductive bias that can lead to instability and
failures if the reference is suboptimal. In contrast, employs a fully permutation-equivariant architecture to predict
affine-invariant camera poses and scale-invariant local point
maps without any reference frames. This design makes our
model inherently robust to input ordering and highly scalable.
These advantages enable our simple and bias-free approach
to achieve state-of-the-art performance on a wide range of
tasks, including camera pose estimation, monocular/video
depth estimation, and dense point map reconstruction. Code
and models are publicly available.