Visual Implicit Geometry Transformer for Autonomous Driving

Visual Implicit Geometry Transformer for Autonomous Driving

Arsenii Shirokov, Mikhail Kuznetsov, Danila Stepochkin, Egor Evdokimov, Daniil Glazkov, Nikolay Patakin, Anton Konushin, Dmitry Senushkin

Lomonosov Moscow State University

Paper Code arXiv 🤗 HuggingFace

Abstract

We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a bird's-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods.

Method

Our architecture consists of three main components: (1) an image encoder (ViT-L) that independently processes each image and extracts feature tokens from the last four layers, producing four sequences of tokens per image; (2) a calibration-free Implicit BEV Projection module that projects tokens from each encoder layer across all images to their corresponding BEV space, generating four layer-specific BEV representations, which are then aggregated and upsampled into a single unified BEV representation using DPT; and (3) a query-based Implicit Decoder that predicts occupancy probabilities for 3D points from the final BEV features. This design enables pure data-driven scene modeling without geometric inductive biases.

Visualizations

The videos and images below follow the same layout: the bottom strip shows camera images used as model input, the right panel shows estimated occupancy, and the left panel shows the LiDAR point cloud for reference.

Videos

Illustrations

Interactive demos

These are interactive demos to explore navigatable model output renders, consistency under limited input and inner attentions workings. If a demo feels cramped, use the “Open fullscreen” link in each section. Best viewed on desktop.

3D rendering

Our occupancy model allows for arbitrary point querying within RoI bounds. In this demo we explore voxel grids (first frame) and point clouds (second frame) created with model by subsampling and ray-marching. Ground Truth LiDAR points and other models predictions are provided for comparison (can be selected in 2/3 windows dropdown menus). You can fly over scene with mouse and WASD/arrows keys.

In the case of bad performance point-cloud only comparison version is available at "Open lite" link.

Open fullscreen Open lite

Scene consistency with single camera input

In this demo we show occupancy produced by model (rendered with nerfacc from the top) with all cameras vs only single camera available. It shows that scene representation stays consistent within observable region. (Unobserved regions are dimmed). You can select input camera by clicking on frustums or camera images.

Open fullscreen

BEV queries attention

This demo observes attention matrix of one of the projectors. It shows how model matches latent BEV query cells and desired image regions internally. You can click on BEV frame to select query cell and image patches will be highlighted according to attention intensity.

Open fullscreen

Image regions lookup

This demo shows Image-BEV learned correspondence from another angle. Pick camera and select image regions (and, correspondingly, patches) and see which BEV queries attend to them by intensity highlight. Multiple regions can be selected on different cameras simultaneously.

Open fullscreen

BibTeX

@article{vigt2026,
  title   = {Visual Implicit Geometry Transformer for Autonomous Driving},
  author  = {Arsenii Shirokov, Mikhail Kuznetsov, Danila Stepochkin, Egor Evdokimov, Daniil Glazkov, Nikolay Patakin, Anton Konushin, Dmitry Senushkin},
  journal = {arXiv preprint arXiv:2602.05573},
  year    = {2026}
}