VGGT: Visual Geometry Grounded Transformer

Summary

VGGT: Visual Geometry Grounded Transformer

核心: 一个 1.2B 参数的大 transformer，single forward pass 直接从 1–数百张图片联合预测相机内外参、深度图、点图、以及用于 point tracking 的 dense feature——无需任何几何后处理，就能击败依赖 bundle adjustment / global alignment 的最强方法。

方法: 普通 ViT-L 架构（24 层），用 Alternating-Attention（frame-wise 和 global self-attention 交替）代替专门的 3D inductive bias；DINOv2 做 patchify；camera head 出 pose，DPT head 出 dense 量；多任务联合监督（overclaim 的 point map 实际是更准确的——decomposition 有益）。

结果: RealEstate10K 上 AUC@30 从 DUSt3R 的 67.7 提到 85.3，feed-forward 比后处理方法还强 ~10 pts；0.2s vs DUSt3R 的 7s。加 BA 能再涨到 93.5。DTU MVS 把不给 GT camera 的 setting 从 DUSt3R 的 Chamfer 1.741 压到 0.382，逼近 know-GT-camera 方法。

Sources: paper | website | github

Rating: 3 - Foundation。1024 citations / 13 个月、29% influential、12.9k⭐、CVPR’25 Best Paper；已经衍生出 SwiftVGGT / InfiniteVGGT / HD-VGGT 等一堆后续工作，是 feed-forward 3D reconstruction 的新 landmark。

Key Takeaways:

3D-as-a-regression-problem 范式成立：不需要为 3D 设计特殊的 inductive bias，一个通用 transformer 在足量 3D 标注数据上训练就能学会几何。这是把 NLP/CV foundation model 思路搬到 3D reconstruction 的一次成功 scale-up。
Over-complete supervision 有益：即使 camera + depth 可以推出 point map（closed-form 关系），训练时同时监督所有量能明显提升每一项的精度（Tab. 6）。但 inference 时通过 depth+camera unprojection 得到的点云比直接用 point map head 更准（Tab. 3）——“分解成子任务” 和 “联合监督” 不矛盾。
Alternating-Attention 是关键：frame-wise + global self-attention 交替，比纯 global self-attention 和 cross-attention 都好（Tab. 5）。这对处理任意多视角且保持平移等变性很重要。
Feed-forward + BA 仍然互补：VGGT 直接预测的 point/depth 可以作为 BA 的好初始化，省掉 triangulation 和 iterative refinement——使得 BA 只需 ~2s 就能把 RealEstate10K AUC@30 从 85.3 再拉到 93.5。
Features 可迁移：backbone 冻结/微调后能显著提升 non-rigid point tracking（CoTracker + VGGT features 在 TAP-Vid 全面超越专用 tracker）和 feed-forward novel view synthesis（不给 input camera 也能接近 LVSM）。

Teaser. VGGT 接受最多上百张图片，一秒内联合输出 cameras、point maps、depth maps、point tracks——无需 3D-specific inductive bias，训练在大量 3D-annotated 数据上。

1. 问题定义

输入是任意 $N$ 张 RGB 图像 $(I_{i})_{i = 1}^{N}$ ，VGGT 的 transformer $f$ 输出每帧对应的一组 3D 量：

f ((I_{i})_{i = 1}^{N}) = (g_{i}, D_{i}, P_{i}, T_{i})_{i = 1}^{N}

符号说明：

$g_{i} \in R^{9}$ : 相机参数 = quaternion(4) + translation(3) + FoV(2)（principal point 固定在图像中心，沿用 VGGSfM 的 parametrization）
$D_{i} \in R^{H \times W}$ : 深度图
$P_{i} \in R^{3 \times H \times W}$ : viewpoint-invariant point map，即所有点都表达在第一个相机 $g_{1}$ 的坐标系下（沿用 DUSt3R 的约定）
$T_{i} \in R^{C \times H \times W}$ : 用于 point tracking 的 dense feature，由单独的 tracking 模块 $T$ 消费

两个重要约定：

第一帧为 world frame： $g_{1}$ 固定为 identity（ $q_{1} = [0, 0, 0, 1], t_{1} = [0, 0, 0]$ ）。架构对除第一帧外的其他帧是 permutation equivariant 的。
Over-complete predictions： $g$ 、 $D$ 、 $P$ 在数学上冗余（camera + depth 可推 point map；point map + PnP 可推 camera）。但论文验证：训练时同时预测这些量比只预测其中一部分更准。

2. 架构：Alternating-Attention + 多 head

Figure 2. Architecture Overview. DINOv2 将每张图 patchify 成 tokens；附加一个 camera token（第一帧和其他帧用不同的 learnable token 以区分 reference frame）和 4 个 register tokens。24 层 alternating frame-wise / global self-attention。Camera head 出相机参数，DPT head 出所有 dense 输出。

2.1 Alternating-Attention (AA)

Transformer 不用 cross-attention，只用 self-attention，但在两种 scope 之间交替：

Frame-wise self-attention：只在单帧内部 tokens 之间做 attention，起到 per-frame activation normalization 的作用
Global self-attention：跨所有帧所有 tokens 做 attention，做跨视角信息融合

默认 $L = 24$ 层，意味着 24 次 frame-wise + 24 次 global，共 48 个 attention block。

❓ 为什么 cross-attention 反而不如 self-attention？论文 Tab. 5 显示 cross-attention 在 ETH3D Chamfer 从 0.709 恶化到 1.061。直觉上 cross-attention 最大化跨帧信息流，但作者只给了 empirical 结论。猜测是 cross-attention 不允许同一帧内 tokens 互相看到，破坏了 within-frame 的 representation 一致性。

2.2 Prediction Heads

Camera head：接在 camera token 上，4 层额外 self-attention + linear，输出 $\hat{g}_{i}$
DPT head：将 image tokens 转成 dense feature map，再经 $3 \times 3$ conv 出 $D_{i}$ 、 $P_{i}$ 、tracking feature $T_{i}$ 。同时预测 aleatoric uncertainty $Σ_{i}^{D}$ 、 $Σ_{i}^{P}$
Tracking head $T$ ：采用 CoTracker2 架构，以 $T_{i}$ 为输入；query 点在 query 图 feature 上双线性采样，与其他帧 feature 做 correlation，再经 self-attention 出 2D 对应点。不假设视频时间序——适合任意图集

3. 训练

3.1 Loss

Equation. 多任务 Loss

L = L_{camera} + L_{depth} + L_{pmap} + λ L_{track}

$L_{camera}$ : Huber loss on $\hat{g} - g$
$L_{depth}$ : DUSt3R-style aleatoric-uncertainty loss，额外加了 gradient term（monocular depth 常用）—— $∥ Σ^{D} ⊙ (\hat{D} - D) ∥ + ∥ Σ^{D} ⊙ (\nabla \hat{D} - \nabla D) ∥ - α lo g Σ^{D}$
$L_{pmap}$ : 同 depth loss 的形式，但用 $Σ^{P}$
$L_{track}$ : CoTracker2 风格的 L1 + visibility BCE， $λ = 0.05$

Camera / depth / pmap loss 量级相近，不需要 rebalance，只对 track loss 降权。

3.2 Coordinate Normalization（关键细节）

Scale 和 global reference frame 对图像不可观——需要定义 canonical。VGGT 的做法：

所有量转到 $g_{1}$ 的坐标系
计算 $P$ 中所有 3D 点到原点的 平均欧氏距离，用此 scale 归一化 $t$ 、 $P$ 、 $D$
关键差异：不像 DUSt3R 那样也对网络预测做归一化，而是让网络学会输出已归一化的结果

作者在 Discussion 里强调：对 prediction 做 normalization 不仅对收敛不必要、对最终性能无益，反而增加训练不稳定性。

3.3 Implementation

架构：ViT-L 级别的 transformer，24 个 AA block，hidden 1024，16 heads，~1.2B 参数
训练：AdamW，160K iter，peak LR 2e-4，8K warmup，cosine schedule
Batch 采样：每 scene 随机 2–24 帧，但每个 batch 总帧数恒为 48；图像 resize 到长边 518，aspect ratio 随机 [0.33, 1.0]
硬件：64 × A100，9 天
稳定化：QKNorm、LayerScale (init 0.01)、gradient norm clip 1.0，bf16 + gradient checkpointing
DPT 取 block 4/11/17/23 的 tokens 做 upsampling

3.4 Training Data

大杂烩式的 16+ 数据集（含 Co3Dv2、BlendMVS、DL3DV、MegaDepth、Kubric、WildRGB、ScanNet、HyperSim、Mapillary、Habitat、Replica、MVS-Synth、PointOdyssey、Virtual KITTI、Aria、Objaverse-like 合成资产），覆盖室内 / 室外 / 合成 / 真实。3D 标注来源包括传感器、引擎渲染、SfM。整体规模和 MASt3R 可比。

4. 主要实验结果

4.1 Camera Pose (RealEstate10K & CO3Dv2, 10 frames)

Table 1. Camera Pose Estimation on RealEstate10K & CO3Dv2. AUC@30（越大越好）。所有 learnable 方法都没在 Re10K 上训练。运行时间 on one H100。 $‡$ 表示并发工作。

Method	Re10K AUC@30 ↑	CO3Dv2 AUC@30 ↑	Time
Colmap+SPSG	45.2	25.3	~15s
PixSfM	49.4	30.1	>20s
DUSt3R	67.7	76.7	~7s
MASt3R	76.4	81.8	~9s
VGGSfM v2	78.9	83.4	~10s
MV-DUSt3R $‡$	71.3	69.5	~0.6s
CUT3R $‡$	75.3	82.8	~0.6s
FLARE $‡$	78.8	83.3	~0.5s
Fast3R $‡$	72.7	82.5	~0.2s
VGGT (Feed-Forward)	85.3	88.2	~0.2s
VGGT + BA	93.5	91.8	~1.8s

Feed-forward VGGT 把此前最强 feed-forward (VGGSfM v2，需 ~10s optimization) 的 78.9 → 85.3（Re10K 是未见过数据集，generalization gap 反而扩大）。加 BA 再 +8 pts，且 BA 阶段只需 ~2s，因为 VGGT 的 point/depth 直接作为初始化，不需要 triangulation 和迭代 refinement。

4.2 Multi-view Depth (DTU)

Table 2. DTU MVS. Chamfer distance 越小越好。是否已知 GT camera 用 ✓/✗ 标注。

Known GT cam	Method	Acc. ↓	Comp. ↓	Overall ↓
✓	GeoMVSNet	0.331	0.259	0.295
✓	MASt3R	0.403	0.344	0.374
✗	DUSt3R	2.677	0.805	1.741
✗	VGGT	0.389	0.374	0.382

不知道 GT 相机的情况下，VGGT 把 Chamfer 从 DUSt3R 的 1.741 干到 0.382，基本追平给 GT 相机的方法。论文归因：multi-image 训练教会模型原生的 multi-view triangulation，而 DUSt3R 只做 pairwise 平均。

4.3 Point Map (ETH3D)

Table 3. Point Map Estimation on ETH3D. DUSt3R/MASt3R 用 global alignment，VGGT 纯 feed-forward。“Ours (Depth + Cam)” 指用 depth head + camera head unproject 生成点云。

Method	Acc. ↓	Comp. ↓	Overall ↓	Time
DUSt3R	1.167	0.842	1.005	~7s
MASt3R	0.968	0.684	0.826	~9s
Ours (Point head)	0.901	0.518	0.709	~0.2s
Ours (Depth + Cam)	0.873	0.482	0.677	~0.2s

关键观察：inference 时 depth+camera 组合出来的点云 比 point map head 直接出的更准。这和训练时”全量监督”不矛盾——论文解读：联合监督帮助特征学习，但推理分解避免 point map head 要同时建模相机和深度两个问题的冗余。

4.4 Two-View Matching (ScanNet-1500)

Method	AUC@5 ↑	AUC@10 ↑	AUC@20 ↑
SuperGlue	16.2	33.8	51.8
LoFTR	22.1	40.8	57.6
Roma	31.8	53.4	70.9
VGGT	33.9	55.2	73.4

即便没有专门训练 two-view matching，VGGT 的 tracking head 也超过专用的 Roma。作者把这归因于学到的 feature 通用性。

5. Ablations

5.1 Attention 架构

Table 5. Backbone Ablation on ETH3D. 三种架构参数量保持相同（ $2 L$ 层）。

Attention	Acc. ↓	Comp. ↓	Overall ↓
Cross-Attention	1.287	0.835	1.061
Global Self-Attention only	1.032	0.621	0.827
Alternating-Attention	0.901	0.518	0.709

Cross-attention 最差（Overall 比 AA 差 50%）；纯 global self-attention 次之；AA 最优。

5.2 Multi-task Learning

Table 6. Task Ablation on ETH3D Point Map.

camera	depth	track	Acc. ↓	Comp. ↓	Overall ↓
✗	✓	✓	1.042	0.627	0.834
✓	✗	✓	0.920	0.534	0.727
✓	✓	✗	0.976	0.603	0.790
✓	✓	✓	0.901	0.518	0.709

去掉 camera loss 影响最大（Overall 0.834），depth loss 次之（0.790），track loss 贡献也有（0.790 → 0.709）。camera pose 这种 coarse scene-level signal 对 point map 的正则化作用最强。

6. 下游应用（冻结 / 微调）

VGGT 已经是一个好 backbone，作者展示两个迁移案例。

6.1 Novel View Synthesis (GSO)

Figure 6. NVS 定性。 顶 input，中 GT target view，下 VGGT-NVS prediction。仅把 input images + Plücker rays for target views 作为 tokens 送进 AA transformer，不给 input camera。

Table 7. GSO view synthesis.

Method	Known Input Cam	Size	PSNR ↑	SSIM ↑	LPIPS ↓
LGM	✓	256	21.44	0.832	0.122
GS-LRM	✓	256	29.59	0.944	0.051
LVSM	✓	256	31.71	0.957	0.027
Ours-NVS (20% 训练数据)	✗	224	30.41	0.949	0.033

不给输入 camera、只用 20% 训练数据，就能接近 LVSM（全监督）的水平——feature quality 的 direct evidence。

6.2 Dynamic Point Tracking (TAP-Vid)

Figure 5. Rigid + Dynamic Tracking 可视化。 上：VGGT 原生 tracking head 对 static scene 的 unordered images 输出轨迹；下：fine-tune VGGT backbone 替换 CoTracker 的 feature extractor，用于 dynamic video。

Table 8. TAP-Vid Benchmark.

Method	Kinetics AJ	Kinetics δ	Kinetics OA	RGB-S AJ	RGB-S δ	RGB-S OA	DAVIS AJ	DAVIS δ	DAVIS OA
TAPTR	49.0	64.4	85.2	60.8	76.2	87.0	63.0	76.1	91.1
LocoTrack	52.9	66.8	85.3	69.7	83.2	89.5	62.9	75.3	87.2
BootsTAPIR	54.6	68.4	86.5	70.8	83.0	89.9	61.4	73.6	88.7
CoTracker	49.6	64.3	83.3	67.4	78.9	85.2	61.8	76.1	88.3
CoTracker + VGGT	57.2	69.0	88.9	72.1	84.0	91.6	64.7	77.5	91.4

把 CoTracker 的 feature backbone 换成 VGGT 的、再 fine-tune，在 Kinetics AJ 从 49.6 涨到 57.2（+7.6 pts）。虽然 VGGT 原生只处理无序 static scene，其 feature 对 dynamic 视频也强。

7. 运行时与内存

Table 9. Runtime / Memory vs Input Frames (H100, FlashAttention v3, 336×518)。

Frames	1	2	4	8	10	20	50	100	200
Time (s)	0.04	0.05	0.07	0.11	0.14	0.31	1.04	3.12	8.75
Mem (GB)	1.88	2.07	2.45	3.23	3.63	5.58	11.41	21.15	40.63

Camera head 只占 ~5% 时间、~2% 显存；DPT 每帧 0.03s / 0.2GB。显存不够可以 frame-by-frame 跑 DPT。Global self-attention 随帧数二次增长——200 帧 40GB 已经是上限，并发工作 Fast3R 用 Tensor Parallelism 缓解，VGGT 可以照搬。

8. 其他重要讨论

Single-view reconstruction：VGGT 原生支持单图输入（global attention 退化为 frame-wise），虽然没专门训练，效果意外地好。不像 DUSt3R/MASt3R 必须把图复制成 pair。
Differentiable BA：小规模尝试有效，但 Theseus 在 PyTorch 里每步慢 4×，large-scale 训练代价太高，暂时搁置。作者认为这是无 GT 3D 场景的自监督信号的潜在方向。
Patchify：DINOv2 明显优于从头训练的 $14 \times 14$ conv，稳定性和 hyperparameter sensitivity 都更好——pretrained visual backbone 是关键。
Limitations：fisheye / panorama 不支持；极端旋转会 drop；大幅 non-rigid 形变会失败；但作者强调这些都可以通过针对性 fine-tune 解决（相比 DUSt3R 等需要重新设计 test-time optimization 好得多）。

❓ 1.2B 参数 + 9 天 × 64 A100 是 “large transformer” 的中等规模——考虑到 3D reconstruction 任务 feature density 更高（pixel-level dense prediction），scale-up 是否还有明显收益？论文没有 scaling law 分析，留给后续工作。

关联工作

基于

DUSt3R (Wang et al., CVPR 2024): viewpoint-invariant point map 的概念、第一相机坐标系、uncertainty loss 都继承于此。VGGT 最大改进：支持 N > 2 帧、免 global alignment post-processing。
MASt3R: DUSt3R 的后续，引入 feature matching；VGGT 数据规模和 MASt3R 可比，但架构策略完全不同。
VGGSfM (Wang et al., CVPR 2024): 同一作者的前作，end-to-end differentiable SfM；camera parametrization ( $q, t, f$ ) 直接来自 VGGSfM；但 VGGSfM 仍依赖 differentiable BA，VGGT 彻底丢弃。
CoTracker2 (Karaev et al.): tracking 模块直接用它的架构。

对比

DUSt3R / MASt3R: 唯二同样不需要 GT camera 的方法，但必须 pairwise + global alignment；VGGT 在所有相关 benchmark 全面超越。
VGGSfM v2: 当时 phototourism / IMC 最强方法；VGGT feed-forward 已接近（AUC@10 71.3 vs 76.8），加 BA 后超越（84.9 vs 76.8），且快 10×。
Fast3R / CUT3R / FLARE / MV-DUSt3R (concurrent work，都试图 remove DUSt3R 的 optimization)：VGGT 性能最强，速度与最快的 Fast3R 持平。

方法相关

DINOv2: patchify 和 positional embedding，是 backbone 的基础。
DPT head (Ranftl et al.): dense prediction 的标准 head。
Register tokens (Darcet et al.): 每帧 4 个 register token，稳定 global attention。
QKNorm / LayerScale: 训练稳定化的 standard tricks。
CoTracker / LocoTrack / TAPTR / BootsTAPIR: tracking 的主要 baseline。
LVSM: NVS 的对标对象，直接输出 target image 而非 3D 表示。
LRM / GS-LRM: feed-forward 3D 生成的同类工作，但每个只做单任务。
Flash Attention v3: 实测速度依赖此。

论文点评

Strengths

问题定式正确：直接挑战”3D 必须 optimization”的 convention，提供有力的 empirical 证据表明 feed-forward 够用。在作者自己的 VGGSfM 上自我革命，这种 taste 很难得。
架构简洁：没有 3D-specific tricks，AA 是唯一的非 vanilla 设计，而且有 ablation 支持。Simplicity 对应 generality——同一个 backbone 迁移到 NVS 和 dynamic tracking 都有收益。
Over-complete supervision vs decomposition-at-inference 的观察：训练时 joint supervise、inference 时把 point map 换成 depth+camera unproject。这个细节很少人说得清，是真正 level-up 的 insight。
全面的实验覆盖：camera pose / MVS depth / point map / matching / tracking / NVS，每项都有强 baseline 对比，且在 Re10K 这类 zero-shot benchmark 做 generalization test。
Engineering 成熟：Flash Attention v3、bf16、gradient checkpoint、QKNorm、LayerScale、DPT multi-layer tap——每处都是当前 best practice，体现作者对 large transformer 训练工程的深度理解。
代码和模型开放：facebookresearch/vggt 12.9k⭐，对 community 影响力肉眼可见。

Weaknesses

Scaling law 缺失：1.2B 是怎么选的？更小 / 更大的效果曲线如何？9 天 × 64 A100 的训练预算对大部分学术组不友好，但没给出 compute vs quality 的 trade-off 指导。
200 帧是硬上限：global self-attention 的二次复杂度使 40GB GPU 在 200 帧就饱和。这限制了 “hundreds of views” 的实际范围——paper abstract 里的 “hundreds” 略微 over-promise。作者承认可以 tensor parallel，但没做。
Non-rigid 场景承认失败：大幅形变不行是硬限制。Dynamic tracking 的好成绩来自 fine-tune CoTracker，而非 VGGT 原生能力——有 misleading 的风险。
First-frame asymmetry 的代价：把第一帧当 reference frame 引入了 permutation variance；不同 first-frame 选择对结果的影响没做 ablation。对真实应用（例如 video streaming）这是个 deployment 问题。
缺少 failure case 的系统分析：论文说”extreme rotation”会 drop，但没有具体的 rotation angle threshold、什么纹理条件 break。对下游 user 不太 actionable。
和 differentiable BA 的关系未彻底解决：作者说 BA 能带来显著提升，但 BA-in-training 因为 4× 慢而放弃。也许这才是真正的 ceiling——论文没给出 “纯 feed-forward 的理论上界” 的证据。

可信评估

Artifact 可获取性

代码: inference + training scripts 都已开源（facebookresearch/vggt）
模型权重: 已发布（VGGT-1B 在 HuggingFace 上有；repo 中有对应下载链接）
训练细节: 较完整——超参（LR 2e-4、8K warmup、batch 策略、attention 配置）、数据集列表、硬件（64 A100 × 9d）、稳定化技巧均披露；但数据配比（各数据集权重）仅说 “approximately similar”，未给出精确 sampling 比例
数据集: 训练数据多数公开（Co3Dv2、BlendMVS、ScanNet、MegaDepth 等），但 “synthetic dataset of artist-created assets similar to Objaverse” 是内部数据，不开源；NVS 也用了”~20% Objaverse 大小”的内部数据

Claim 可验证性

✅ “reconstruct in under one second”：Tab. 9 给了详细 runtime，可直接验证，且开源代码可复现。
✅ “SOTA on camera pose / MVS / point map / matching”：所有 benchmark 都是标准 public dataset，复现成本低，已有多个独立复现确认。
✅ “simple feed-forward beats optimization-based methods”：Tab. 1/2/3 都有 5s+ 优化方法对比，差距显著（非 margin）。
⚠️ “one, a few, or hundreds of views”：200 帧需要 40GB，实际 “hundreds” 上限约 200–300；“thousands” 级别的 scene 不支持（作者没 claim，但 marketing 语境里容易被误读）。
⚠️ “significantly enhances downstream tasks”：NVS 用的是 20% Objaverse 的内部数据集，与 LVSM 不严格可比；不过 PSNR 30.4 vs LVSM 31.7 的定量差距已能说明问题。
⚠️ “no 3D-inductive biases except AA”：严格来讲 over-complete prediction targets 本身是一种 bias；camera parametrization $(q, t, f)$ 、first-frame-as-world、point-map 的 viewpoint invariance 都是 3D knowledge 编码进 loss/output space——只是不在架构层。作者的 phrase 严格意义上对，但容易被误读。

Notes

VGGT 的真正贡献是 “3D reconstruction is just another sequence-to-sequence problem” 这个 framing。它让 3D 社区终于摆脱”必须做 optimization”的 prior，类似于 CLIP 让 vision 社区摆脱”必须搞好 proxy task”的 prior。
作为 backbone，VGGT 对 downstream 的增益（NVS、dynamic tracking）是 early signal——后续工作大概率会把 VGGT features 用在更多 3D-grounded 任务：scene graph、affordance prediction、VLA 的 spatial grounding。
和 VLA / spatial reasoning 方向的连接值得挖：robot policy 需要的是 camera + depth + tracking 的联合 grounding，VGGT 天然提供，而且 feed-forward 足够快（< 1s per step）——可以做 policy conditioning 的一个标准 preprocessor。
AA 和 VLM 里的”alternating attention” / “bidirectional context injection” 思路相通（Perceiver、Flamingo 的 cross-attention vs self-attention 讨论），但 VGGT 给出的 ablation 是”cross-attention 反而不如交替 self-attention”——这个结论可能对 VLM 社区也有启发。
HF upvotes 只有 37 偏低，但 citation / GitHub / 后续工作衍生数量说明 impact 毋庸置疑；HF papers 的受众可能对 3D reconstruction 不 core。

Rating

Metrics (as of 2026-04-23): citation=1024, influential=300 (29.3%), velocity=77.0/mo; HF upvotes=37; github 12939⭐ / forks=1428 / 90d commits=0 / pushed 51d ago

分数：3 - Foundation

理由：一年出头 1024 citations、29% influential citation ratio（远高于典型 10%，说明被实质继承而非礼貌性引用）、CVPR’25 Best Paper、12.9k⭐、已衍生 SwiftVGGT / InfiniteVGGT / HD-VGGT 等系列后续工作——这些信号合在一起，VGGT 已经是 feed-forward 3D reconstruction 的 landmark。相对 Frontier（2）：VGGT 不是”下一个 SOTA baseline”，而是开启了新范式、被广泛当作 feature backbone 使用，满足 Foundation 的”方向必读、必引”标准。相对”历史 landmark”型 Foundation（如 ImageNet、DROID），VGGT 出现时间短，但 influential/total 比例已经体现出被实质继承，未来 3 年内降档可能性很低。

Sources:

MindFlow

Explorer

VGGT: Visual Geometry Grounded Transformer

Summary

1. 问题定义

2. 架构：Alternating-Attention + 多 head

2.1 Alternating-Attention (AA)

2.2 Prediction Heads

3. 训练

3.1 Loss

3.2 Coordinate Normalization（关键细节）

3.3 Implementation

3.4 Training Data

4. 主要实验结果

4.1 Camera Pose (RealEstate10K & CO3Dv2, 10 frames)

4.2 Multi-view Depth (DTU)

4.3 Point Map (ETH3D)

4.4 Two-View Matching (ScanNet-1500)

5. Ablations

5.1 Attention 架构

5.2 Multi-task Learning

6. 下游应用（冻结 / 微调）

6.1 Novel View Synthesis (GSO)

6.2 Dynamic Point Tracking (TAP-Vid)

7. 运行时与内存

8. 其他重要讨论

关联工作

基于

对比

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents