WinT3R: Window-Based Streaming Reconstruction With Camera Token Pool

arXiv 2025

Abstract:

We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets.

Haoyi Zhu 朱皓怡
Haoyi Zhu 朱皓怡
Ph.D student in Computer Science

My research interests include World Model, Embodied AI and Spatial Intelligence.

19,756 Total Pageviews