An Embodied Generalist Agent in 3D World

Summary

An Embodied Generalist Agent in 3D World

核心: 首个能在 3D 世界中感知、推理、规划和行动的 embodied 多模态 generalist agent

方法: Object-centric 3D 表示 + LLM，两阶段训练（VL alignment + VLA instruction tuning），LLM 辅助生成高质量 3D-language 数据

结果: 在 3D captioning、QA、embodied reasoning、navigation、manipulation 等任务上超越 task-specific 模型

Sources: paper | website | github

Rating: 2 - Frontier（3D embodied generalist 方向的代表性早期工作与常见 baseline，object-centric 3D + LLM 范式有影响力，但尚未成为 de facto standard）

Key Takeaways:

Object-centric 3D 表示是连接 3D 和 LLM 的有效方案: 相比 3D-LLM 的复杂 feature aggregation，object-centric point cloud + Spatial Transformer 更简洁且效果更好
两阶段训练（alignment → instruction tuning）对 3D VL 理解至关重要: Ablation 显示 alignment stage 对 Scan2Cap 等细粒度理解任务提升显著
Generalist 训练优于 specialist: 多场景多任务的 instruction tuning 不仅不损害单任务性能，还带来跨场景零样本泛化能力
VL 和 VLA 存在 tension: 加入 embodied acting 任务反而会降低 VL 性能，VL 和 action 的联合学习仍是开放问题

Teaser. LEO 系统概览：接收 2D 图像、3D 点云和文本，统一为 autoregressive sequence prediction 完成多模态多任务

Introduction

构建能在 3D 世界中理解和交互的 generalist agent 面临三大挑战：(1) 缺乏合适的 3D 数据集；(2) 缺乏统一模型架构；(3) 缺乏有效学习策略。现有 generalist 模型（Gato、PaLM-E、RT-2）主要在 2D domain 中工作，缺少对 3D 物理环境的理解能力。LEO 提出用 object-centric 3D representation + LLM 构建统一框架，同时处理 perceiving、grounding、reasoning、planning 和 acting。

Model

LEO 将所有模态数据转换为 token 序列，统一送入 decoder-only LLM 处理：

Equation 1. Unified token sequence

system message You are... 2D image tokens (optional) s_{2D}^{(1)}, ..., s_{2D}^{(M)} object-centric 3D tokens s_{3D}^{(1)}, ..., s_{3D}^{(N)} instruction USER:... ASSISTANT: response s_{res}^{(1)}, ... s_{res}^{(T)}

符号说明: $s_{2D}$ 为 ego-centric 2D image tokens， $s_{3D}$ 为 object-centric 3D tokens， $s_{res}$ 为 response tokens 含义: 所有任务（VL 理解、embodied acting）统一为给定 prefix 的 autoregressive language modeling

Video. Scene representation

Video. Model pipeline

Tokenization

Text: SentencePiece tokenizer（32k subwords）
2D: Ego-centric 2D images → OpenCLIP ConvNext 编码
3D: Scene point cloud → Mask3D object proposals → 各 object point cloud → 3D encoder（PointNet++）→ Spatial Transformer → object-centric 3D tokens
Action: 连续动作离散化，离散动作映射到 SentencePiece 中最低频的 reserved tokens

Token Embedding & LLM

Text & 2D: Embedding look-up table + OpenCLIP ConvNext encoder，MLP adapter 对齐维度
Object-centric 3D: PointNet++ → Spatial Transformer（用相对位置和尺寸偏置 attention score，捕捉物体间 3D 空间关系）
LLM: Vicuna-7B + LoRA（冻结 LLM 主体，仅调优 LoRA 参数）

Training & Inference

Equation 2. Prefix language modeling loss

L (θ, B) = - b = 1 \sum ∣ B ∣ t = 1 \sum T lo g p_{θ} (s_{res}^{(b, t)} ∣ s_{res}^{(b, < t)}, s_{prefix}^{(b)})

符号说明: $s_{prefix}$ 为 system message 到 instruction 的所有 prefix tokens， $s_{res}$ 为 response tokens 含义: 标准 prefix language modeling，仅对 response 部分计算 loss

训练时冻结 3D point cloud encoder 和 LLM，微调 2D image encoder、Spatial Transformer 和 LoRA 参数。总参数量 ~7B，可训练参数 ~142M。推理使用 beam search 生成文本，action 通过映射还原。

Datasets

LEO 采用两阶段训练数据：LEO-align（3D VL alignment）和 LEO-instruct（3D VLA instruction tuning）。

Table 1. 数据统计

Dataset	Task	2D required?	3D assets	data
LEO-align	object captioning	✗	Objaverse	660k
LEO-align	object referring	✗	ScanNet + 3RScan	354k
LEO-align	scene captioning	✗	3RScan	20k
LEO-instruct	3D captioning	✗	ScanNet	37k
LEO-instruct	3D QA	✗	ScanNet + 3RScan	83k
LEO-instruct	3D dialogue	✗	3RScan	11k
LEO-instruct	task planning	✗	3RScan	14k
LEO-instruct	navigation	✓	MP3D	60k
LEO-instruct	manipulation	✓	CLIPort	300k

Video. Data pipeline overview

LEO-align: 3D Vision-Language Alignment

三类 3D captioning 数据用于 3D-language alignment：

Object-level captions: 单个 3D 物体与描述的对齐（Cap3D / Objaverse）
Object-in-the-scene captions: 场景中物体的 referring expression（ScanNet + 3RScan）
Scene-level captions: 整个 3D 场景的自然语言描述（3RScan）

LEO-instruct: Instruction Following in 3D world

指令微调数据覆盖六类任务：

3D captioning & QA: 给定 3D 场景输入，生成描述或回答问题
3D dialogue & planning: 对复杂多轮指令生成灵活连贯的回复和任务计划
Navigation & manipulation: 在 3D 环境中完成 embodied acting（ObjNav on MP3D + CLIPort manipulation）

LLM-assisted 3D-language Data Generation

核心流程：3D scene graph → LLM prompting → O-CoT → refinement

Scene-graph-based prompting: 用 3DSSG 提供场景上下文，包含 object attributes 和 spatial relations，比 object boxes 信息更丰富
Object-centric CoT (O-CoT): 要求 LLM 在生成时显式给出 object label 和 ID 作为 thought，减少幻觉
Refinement: 基于 scene graph 的 human-defined filters，自动检测并修正逻辑错误（counting / existence / non-existence）

Table 2. LLM 生成数据质量评估

	Counting	Existence	Non-existence
3D-LLM	56.5	96.8	40.0
Ours	57.4	91.3	27.4
+ O-CoT	78.0	93.4	30.5
+ refinement	100.0	100.0	100.0

Insights: O-CoT 对 counting 准确率提升最大（+20.6%），refinement 进一步将所有类别提升到 100%，验证了 scene graph 作为 ground truth 校验源的有效性。

Capabilities and Analyses

3D Vision-Language Understanding and Reasoning

在 Scan2Cap、ScanQA、SQA3D 三个 benchmark 上评估，使用 Mask3D object proposals（非 GT segments）。

Table 4. 3D VL 理解与 embodied reasoning 定量比较（部分）

	Scan2Cap C	ScanQA C	ScanQA EM@1	SQA3D EM@1
Scan2Cap (specialist)	35.2	-	-	-
Vote2Cap-DETR	61.8	-	-	-
3D-VisTA	66.9	69.6	22.4	48.5
3D-LLM (FlanT5)	-	69.4	20.5	-
LEO	72.4	101.4	24.5	50.0

Insights: LEO 作为 generalist 模型（无 task-specific fine-tuning）在所有任务上超越了 task-specific 模型和先前的 generalist（3D-VisTA、3D-LLM），验证了 object-centric 3D + LLM 架构的有效性。

Scene-grounded Dialogue and Planning

定性研究显示 LEO 能生成精确 grounded 到 3D 场景的回复：任务计划涉及具体 objects 和 plausible actions，描述中包含丰富的空间关系信息。

Embodied Action in 3D World

Table 5. Robot manipulation 结果

	separating-piles seen/unseen	packing-google-objects-seq seen/unseen	put-blocks-in-bowls seen/unseen
CLIPort (single)	98.0 / 75.2	96.2 / 71.9	100 / 25.0
CLIPort (multi)	89.0 / 62.8	84.4 / 70.3	100 / 45.8
LEO	98.8 / 75.2	76.6 / 79.8	86.2 / 35.2

Table 6. Object navigation 结果

	MP3D-val Success/SPL	HM3D-val Success/SPL
Habitat-web (demo)	35.4 / 10.2	-
ZSON	15.3 / 4.8	25.5 / 12.6
LEO	23.1 / 15.2	23.1† / 19.1†

Insights: LEO 在 manipulation unseen tasks 上展现了语义级泛化能力（如 packing-google-objects unseen 超越所有 baseline）。Navigation 方面 SPL 优于 baseline 说明 object-centric 3D input 提供了粗粒度 global map，帮助缩短路径。†标注的 HM3D 是零样本迁移结果。

More Insights into LEO

Table 7. 不同数据配置的 ablation

	Scan2Cap	ScanQA	SQA3D	3RQA	3RDialog	3RPlan
w/o Align	62.8	22.7	50.9	49.7	73.0	80.3
ScanNet only	64.0	24.4	46.8	35.8	25.5	23.4
w/o Act	65.4	24.3	50.0	51.9	73.3	81.1
VLA	65.3	25.0	46.2	51.3	72.3	77.2

Insights:

Alignment: 去掉 alignment stage 在 Scan2Cap 上掉 2.6 分（62.8 vs 65.4），fine-grained captioning 依赖 VL alignment
Generalist > Specialist: ScanNet-only 在跨场景（3RQA 35.8 vs 51.9）和跨任务（3RDialog 25.5 vs 73.3）上大幅下降
VLA tension: VLA 在 SQA3D（46.2 vs 50.0）上明显低于 w/o Act，embodied acting 任务与 VL 存在学习冲突
Data balancing: 不平衡数据导致 hallucination（object-existence 几乎全回答 Yes），augmentation 负样本后 overall 准确率从 34% 提升到 85%

Scaling Law Analysis

测试了 OPT-1.3B → Vicuna-7B → Vicuna-13B 三个 scale 的 LLM，以及有无 alignment 的影响。

关键发现：

LEO 的 instruction tuning loss 随数据量增长呈 log-linear 下降，符合 scaling law
7B → 13B 的提升幅度明显小于 1.3B → 7B，暗示在当前数据规模下 LLM 可能接近 saturation，需要更多数据匹配模型容量
Alignment 在所有数据 scale 上都带来一致的 loss 降低

关联工作

基于

Vicuna-7B: LLM backbone
PointNet++: 3D point cloud encoder
Mask3D: 3D instance segmentation，提供 object proposals
Spatial Transformer: 来自 3D-VisTA，用相对位置/尺寸偏置 attention
LoRA: LLM 高效微调
OpenCLIP ConvNext: 2D ego-centric image encoder

对比

3D-VisTA: Task-specific fine-tuned generalist，LEO 在所有任务上超越
3D-LLM: 使用 multi-view images 的 3D VLM，LEO 用更简洁的 object-centric 方法取得更好效果
CLIPort: Manipulation baseline，LEO 在部分 unseen tasks 上更优
Gato / PaLM-E / RT-2: VLA generalist 前辈，但仅处理 2D 输入

方法相关

Cap3D: Object-level captioning 数据来源
3DSSG: Scene graph 数据用于 LLM-assisted data generation
ScanNet / 3RScan: 核心 3D scene 数据集

论文点评

Strengths

首个统一 3D VL 和 embodied acting 的 generalist 框架: 在同一个模型中完成 captioning、QA、dialogue、planning、navigation、manipulation 六大类任务，且不需要 task-specific head
Object-centric 3D 表示设计简洁有效: 用 Mask3D proposals + PointNet++ + Spatial Transformer 建立 object-centric token，比 3D-LLM 的 multi-view feature aggregation 更简洁，效果更好
LLM-assisted data pipeline 设计严谨: Scene graph prompting + O-CoT + refinement 三层质量控制，counting 准确率从 57.4% → 100%，方法有技术含量
全面的 ablation 和 scaling 分析: 对 alignment、generalist vs specialist、VL vs VLA 等关键设计决策都做了 controlled experiments，提供了有价值的 insights

Weaknesses

3D scene 依赖 GT/Mask3D proposals: Object-centric 表示预设了良好的 instance segmentation，在 open-world 或 noisy perception 下是否仍然有效存疑
VL 和 VLA 的 tension 未解决: 加入 acting 任务会损害 VL 性能（SQA3D 从 50.0 降到 46.2），作者指出但未提出解决方案
Navigation 只用 truncated past actions: 不用 recurrent module 限制了长时序导航能力，success rate 低于 Habitat-web baseline
3D 数据规模仍然有限: LEO-align 约 1M、LEO-instruct 约 500K 条数据，相比 2D VLM 数据规模差距巨大，scaling law 分析也显示 data 是瓶颈

可信评估

Artifact 可获取性

代码: inference+training 均已开源（GitHub repo 带完整 configs/ 与训练脚本）
模型权重: align.pth（alignment stage checkpoint）、sft_noact.pth（instruction tuning without acting tasks checkpoint），发布在 HuggingFace
训练细节: 超参 + 数据配比 + 训练步数完整（configs/ 目录包含详细配置）
数据集: 开源（LEO_data on HuggingFace，含 scan data + language annotations + EAI data examples）

Claim 可验证性

✅ 3D VL 任务 SOTA：Scan2Cap / ScanQA / SQA3D 均有标准 benchmark 评估，数据和代码开源
✅ Alignment stage 重要性：ablation 有 controlled experiment（w/o Align vs w/o Act）
✅ Generalist > Specialist：ScanNet-only vs full 数据的 ablation 直接可验证
⚠️ “首个 embodied generalist agent in 3D world”：定义上 PaLM-E 等也处理 3D 信息，但 LEO 强调的是 explicit 3D point cloud input 而非 2D → 3D
⚠️ Navigation 泛化能力：HM3D 零样本结果仅与 ZSON 比较，缺少更多 baseline

Notes

Rating

Metrics (as of 2026-04-24): citation=357, influential=36 (10.1%), velocity=12.23/mo; HF upvotes=8; github 483⭐ / forks=41 / 90d commits=0 / pushed 368d ago · stale

分数：2 - Frontier 理由：LEO 是 3D embodied generalist 方向（VLA × 3D scene understanding）的代表性早期工作，object-centric 3D + LLM 的范式被后续 3D-LLM / 3D VQA / embodied reasoning 工作广泛引用和比较（ICML 2024，社区关注度高、开源完整），符合 “Frontier / SOTA / 必要 baseline” 的 2 档定位。之所以不是 3：该范式并未像 CLIP / SAM / ScanNet 那样成为 3D embodied 领域不可绕开的 building block——object-centric 依赖 Mask3D proposals 的前提在 open-world 下受限，VLA tension 也未解决；不是 1：在 3D VL + embodied action 统一框架这条线上，至今仍是常见比较对象，并未被取代。

MindFlow

Explorer

An Embodied Generalist Agent in 3D World

Summary

Introduction

Model

Tokenization

Token Embedding & LLM

Training & Inference

Datasets

LEO-align: 3D Vision-Language Alignment

LEO-instruct: Instruction Following in 3D world

LLM-assisted 3D-language Data Generation

Capabilities and Analyses

3D Vision-Language Understanding and Reasoning

Scene-grounded Dialogue and Planning

Embodied Action in 3D World

More Insights into LEO

Scaling Law Analysis

关联工作

基于

对比

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents