LAPA: Latent Action Pretraining from Videos

Summary

LAPA: Latent Action Pretraining from Videos

核心: 第一个无需 ground-truth 机器人动作标签就能预训练 VLA 的方法——从视频帧对学 VQ-VAE 式离散 latent action，再用少量带标签数据把 latent 映射到真实动作空间

方法: 两阶段 + 一阶段 finetune：(1) VQ-VAE 式 latent action quantization（编码 $x_{t}, x_{t + H}$ → 离散 token $z_{t}$ ，解码重建 $x_{t + H}$ ）；(2) 用 VLM（LWM-Chat-1M 7B）做 latent pretraining 预测 $z_{t}$ ；(3) 小规模带动作数据做 action finetuning，换掉 latent action head

结果: 在 Open-X 上预训练的 LAPA 在真机任务平均成功率超过 OpenVLA +6.22%（50.1 vs 43.9），预训练效率高 30-40x；纯人类视频（Something-Something V2）预训练也能正迁移到真机

Sources: paper | website | github

Rating: 2 - Frontier（打开了 “从无动作标签视频预训练 VLA” 的范式，被后续 DreamGen / UWM / GR00T 等工作继承，但 latent action 范式本身仍在演化，尚未定型为 standard recipe）

Key Takeaways:

Latent action as “action BPE”: 类比 BPE 对语言建模，VQ-VAE 在帧对之间学出的 discrete latent tokens 可以作为 embodiment-agnostic 的”原子动作词表”，无需预设 end-effector / joint 等先验
Shared latent space across embodiments: 在 Open-X 多 embodiment 数据上，同一 latent action 在不同机器人的 reconstructed image 里对应相似的动作方向——这是 LAPA 能超过 OpenVLA 的关键原因，ground-truth action pretraining 会过拟合到源 embodiment 的 action space
Human video → robot transfer works: 只用 Something-Something V2（人类手部操作视频）预训练，下游真机平均仍能打平/超过 OpenVLA (Bridge)，验证了 web-scale video 作为 robotics 预训练数据的可行性
Grasping 是弱项: LAPA 在 reaching 上优于 OpenVLA（83.3 vs 66.7），但 pick-and-place 逊色，早期抓取失败率高——latent action 粒度对 fine-grained motor skill 不够
兼做 world model: Quantization 的 decoder 可以作为 neural simulator，条件在 $x_{t}$ 和 LAPA 预测的 latent action 上生成未来帧，形成 closed-loop rollout

Teaser. 问题定义：从无动作标签的人类视频中学 robotic foundation model。

方法

Overall Pipeline

Figure 2. LAPA 两阶段预训练 + 一阶段 finetune。

三个模型顺序训练：

Latent Action Quantization（VQ-VAE）：编码器吃 $x_{t}, x_{t + H}$ 输出离散 $z_{t}$ ，解码器重建 $x_{t + H}$
Latent Pretraining：VLM 预测 $z_{t}$ 条件于 $x_{t}$ + language instruction，本质是在 pseudo-label 上做 behavior cloning
Action Finetuning：少量真机带动作轨迹，丢掉 latent action head，新建一个离散化 delta end-effector action head

1. Latent Action Quantization

架构：encoder-decoder，encoder 用 C-ViViT tokenizer 变体（spatial + temporal transformer），decoder 只用 spatial transformer（因为只两帧输入）。

关键设计选择：

Cross-attention 而非 additive embedding：让 $z_{t}$ 通过 cross-attention 关注 $x_{t}$ ，比 Genie 的 additive embedding 更能捕捉 semantically meaningful latent action
NSVQ 替换标准 VQ 量化：把量化误差替换为原 error 与 normalized noise vector 的乘积，缓解梯度坍缩
Stop gradient 作用在 decoding 时 $x_{t}$ 的 patch embedding，避免 representation collapse
Codebook replacement（来自 NSVQ）：早期训练阶段替换未用的 codebook，最大化使用率
Latent action 表示： $s$ 个 token 序列，每个 token 来自大小 $∣ C ∣$ 的 codebook。主实验用 $8^{4}$ （序列长度 4，vocab 8），Language Table 除外

❓ $H$ （frame gap）如何选、如何影响 latent 粒度？论文正文没详细说。附录 A 应有说明，后续如果用这方法需细读。

2. Latent Pretraining

目标：用 quantization encoder 作为 inverse dynamics model，为每个 $x_{t}, x_{t + 1}$ 对打出 pseudo label $z_{t}$ ；然后让预训练 VLM (LWM-Chat-1M 7B) 在 $(x_{t}, instruction) \to z_{t}$ 上做 autoregressive 预测。

实现细节：

不用原 LM head，外挂一个大小为 $∣ C ∣$ 的 latent action head（单层 MLP）
冻结 vision encoder，解冻 LM
与传统 action 粒度（EE / joint）无关——完全由数据驱动地学”连续观测 delta 的压缩表示”

3. Action Finetuning

把 delta EE action 按维度离散化为等频 bin（follows OpenVLA）
丢弃 latent action head，新建 discrete action head
同样冻结 vision encoder，解冻 LM
论文尝试保留 latent head + 再加一个 decoder head（Schmidt & Jiang 2024 的做法），但发现对 7B policy model 效果更差——重新初始化 head 反而更好

实验

4.1 Benchmarks

Language Table（2-DOF push blocks，5 subtask）
SIMPLER（WidowX 7-DOF，4 tasks）——由于没有 fine-tune 轨迹，用 BridgeV2 VLA 成功 rollout 收 100 条多任务轨迹
真机 Franka Panda 7-DOF，3 个多指令任务：(1) Pick ⟨object⟩ into Sink, (2) Cover ⟨object⟩ with Towel, (3) Knock ⟨object⟩ Over；每任务 150 条轨迹、15 个物体

预训练数据：BridgeV2 / Open-X / Something-Something V2。

4.2 Baselines

Scratch：只在下游数据 finetune backbone VLM
UniPi（Du 2023）：video diffusion 预训练 + IDM finetune 提取真实动作
VPT（Baker 2022）：在带标签数据上训 IDM → 给无标签视频打 pseudo action → VLM 预训练；与 LAPA 的 pipeline 结构完全对应，只是 pseudo label 来源不同（IDM 预测的真实 action 空间 vs LAPA 的 latent 空间）
ActionVLA：直接在 ground-truth action 上预训练的同 backbone 模型（上界 baseline）
OpenVLA（Kim 2024）：970K Open-X 真实 demo 预训练的 7B VLA SOTA

4.3 Language Table 结果

Table 1. Success rate (%) ± StdErr。Cross-env 列 pretraining 在 440k 真机轨迹，finetune 在 1k 仿真。

	In-domain Seen	In-domain Unseen	Cross-task Seen	Cross-task Unseen	Cross-env Seen	Cross-env Unseen
Scratch	15.6	15.2	27.2	22.4	15.6	15.2
UniPi	22.0	13.2	20.8	16.0	13.6	12.0
VPT	44.0	32.8	72.0	60.8	18.0	18.4
LAPA	62.0	49.6	73.2	54.8	33.6	29.6
ActionVLA ⭐	77.0	58.8	77.0	58.8	64.8	54.0

观察：

LAPA 在 in-domain / cross-env 里全面压过 VPT 和 UniPi，与 ActionVLA（上界）差距被显著缩小
Cross-task unseen 里 VPT 略胜 LAPA（60.8 vs 54.8）——作者归因为 VPT 使用了更多 labeled data（7k vs 1k）让 IDM 更准；但 cross-env 里 VPT 的 IDM 严重退化（18 vs LAPA 33.6），说明 latent action 比 IDM pseudo action 对环境 shift 更 robust

4.4 真机结果

Figure 3. 真机 54 rollouts 平均成功率。

Table 2. 按 generalization 类型拆分的成功率（%）。

	Seen Obj. Unseen Combo	Unseen Obj.	Unseen Instr.	AVG
Scratch	18.0	20.3	25.4	21.2
ActionVLA (Bridge)	38.3	31.8	27.7	32.6
OpenVLA (Bridge)	35.6	34.6	22.1	30.8
LAPA (Bridge)	43.4	31.4	35.6	36.8
OpenVLA (Open-X)	46.2	42.1	43.4	43.9
LAPA (Open-X)	57.8	43.9	48.5	50.1
LAPA (Human Videos)	36.5	37.4	28.1	34.0

关键发现：

LAPA > OpenVLA (Open-X) +6.22%，且在 3 类 generalization 全面领先——与仿真中”ActionVLA 是上界”的直觉相反
作者解释：有 ground-truth action 的 pretraining 会过拟合到源 embodiment (WidowX) 的 action space，跨 embodiment finetune 时成为负担；LAPA 的 shared latent 没有这个问题
LAPA (Human Videos) > OpenVLA (Bridge) 34.0 vs 30.8——跨 embodiment gap 更大的情况下仍能匹敌/超过同等规模的真机数据预训练
失败模式：LAPA 在 pick-and-place 里 reaching 成功率 83.3% 高于 OpenVLA 66.7%，但 grasping 错位导致总成功率反而更低——latent action 不够精细

4.6 预训练效率

LAPA (Open-X) 预训练：8× H100, 34h, batch 128 → 272 H100-hours
OpenVLA 预训练：21,500 A100-hours, batch 2048
粗算 LAPA 约快 30-40×（考虑 H100 2-3× 速度优势）

效率来源：

LWM backbone 在预训练目标里已含 next-frame prediction，隐式学过高层动作结构
LAPA action space 小得多（ $8^{4} \approx 4000$ vs OpenVLA $25 6^{7}$ ），学起来容易

LAPA 所有变体单 epoch 即达最优；ActionVLA 3 epoch，OpenVLA 需 30 epoch。

5.1 Scaling 分析

三维度 scaling：quantization model size、预训练数据量、latent action 表示空间（序列长度 × vocab size）都带来正收益。

关键 insight：latent action space 的最优规模依赖于数据本身的动作复杂度。

Language Table（2-DOF）：增 vocab 比增 sequence length 更有效
其他任务： $8^{4}$ 足够

暗示 scaling 到 web-scale 视频（包含 whole-body control）时需要扩大 latent action space。

5.2 Latent Action Analysis

Figure 6. 不同 embodiment 上条件同一 latent action 得到相似动作——latent space 真的是 shared 的。

qualitative 发现：

Language Table 上每个 latent action 对应一个明确 2D 运动方向，latent 聚类与真实 action space 对齐
Human manipulation 视频里 latent action 会吸收 camera viewpoint 变化（视频里视角会动）——这既是 feature（捕捉 environment-centric motion）也是 bug（动作语义被稀释）
Open-X 多 embodiment 上同一 latent 在不同机器人 reconstruction 里给出一致方向——支持跨 embodiment 正迁移的来源假说

Figure 7. LAPA 作为 neural world model 的 closed-loop rollout：“take broccoli out of pot” 指令下能生成完整 rollout。

关联工作

基于

Genie: latent action quantization 的直接来源；LAPA 改 additive → cross-attention、引入 NSVQ、并把 latent action 用于 VLA 训练而非 game 生成
LWM-Chat-1M (Liu 2024): 7B backbone VLM，next-frame prediction 的预训练目标为 LAPA 的高效性提供了 prior
VQ-VAE (van den Oord 2017): 离散 latent 的基础方法

对比

OpenVLA: 最强 baseline，用 970K Open-X ground-truth action 预训练；LAPA 在真机 +6.22% 并快 30-40×，核心对照
VPT (Baker 2022): IDM pseudo-action 预训练，与 LAPA pipeline 结构对称；LAPA 的优势证明 latent action space > IDM 的真实 action 空间作为 pretraining target
UniPi (Du 2023): video diffusion 预训练 + IDM finetune；LAPA 全面超过，说明直接预测 action token 比先生成视频再提动作更有效

方法相关

Genie: latent action for game generation
Vista / Robotic World Model: 世界模型方向，LAPA 的 quantization decoder 可视为轻量 neural simulator
Schmidt & Jiang 2024 (Learning to Act without Actions): 类似思路在游戏环境
后续工作：DreamGen（video-based data generation）、UWM（unified world-action model）、GR00T N1 等都继承了 “latent action / video as data” 的路线

论文点评

Strengths

范式明确：把”无动作标签视频 → VLA”从 hand-designed feature（affordance、optical flow、hand pose）升级到 end-to-end 的 latent action，干净且 scalable。类比 BPE 的比喻很贴切
跨 embodiment 的”负结果”很有说服力：ActionVLA 在仿真里是上界，但在真机跨 embodiment 反而不如 LAPA——这个”违反直觉的结果”是论文最有信息量的 claim，且作者提供了合理的 mechanistic 解释（action space overfit）
效率数字硬：30-40× 预训练加速 + 相当/更好的下游性能，是实实在在的 engineering win，降低了 VLA 研究的入场门槛
Human video 真能用：SSv2 预训练 LAPA 匹敌真机数据训的 OpenVLA，是”web-scale video → robotics foundation” 叙事的第一个强证据
Quantization model 双用：encoder 作 IDM，decoder 作 world model——一套参数撑起 pseudo labeling + closed-loop rollout，方法紧凑

Weaknesses

Latent action 的 interpretability 是双刃：好处是能分析（Figure 6），坏处是它会把 camera motion、scene motion、agent motion 混在一起——在 human video 上尤其明显。这也解释了为什么 grasping 这种需要精细 agent-centric control 的任务上 LAPA 较弱
Granularity 硬编码： $H =$ fixed frame gap、 $s = 4$ 、 $∣ C ∣ = 8$ 是手工选的；不同数据、不同任务最优值不同。缺乏 adaptive 的 latent granularity 方案
Finetune 数据量仍不小：Franka 任务 150 条 ×3 = 450 条带动作数据——虽然比 OpenVLA 的 970K 少得多，但远未到”zero/few-shot from video only”。论文 framing 略 overclaim
未探索 non-manipulation：作者自己承认没有试 navigation、driving、landscape；而 latent space 能”吞掉” camera motion 的特性在这些 domain 可能更关键
Scaling 实验有限：三个维度的 scaling 都是在 Bridge→SIMPLER 这个相对小的 setup 上做的；扩到 Open-X → 真机或真 web video 的 scaling curve 没给出
Baseline 对比不完全公平：UniPi 是 video generation → IDM，跟 LAPA 用同级别的 backbone 但 UniPi 的 backbone 不是 LWM——UniPi 的弱可能部分来自 backbone 差异而非方法本身

可信评估

Artifact 可获取性

代码: inference + training（LAPA GitHub 公开 finetune / pretraining scripts、latent quantization 模型训练代码）
模型权重: 已发布 LAPA-7B-openx（HuggingFace latent-action-pretraining/LAPA-7B-openx），基于 Open-X 预训练
训练细节: 超参 + H100-hours + batch size + epoch 数都披露；附录 A、B 给出完整模型架构与训练超参
数据集: 全部开源（Open-X Embodiment、BridgeV2、Language Table、SIMPLER、Something-Something V2）；真机数据未公开但协议清晰

Claim 可验证性

✅ “Open-X LAPA > OpenVLA +6.22%“：提供了详细 per-task、per-generalization-type 的数字（Table 2 / Figure 3），54 rollouts，StdErr 可查；模型开源可复现
✅ “30-40× 预训练效率”：GPU-hours 硬数字（272 H100-h vs 21,500 A100-h），但换算假设（H100 2-3× A100）需要 caveat
⚠️ “latent action 是 shared representation”：只有 qualitative 图例（Figure 6）+ reconstructed images，没有量化的跨 embodiment 一致性指标；支持结论但不强证
⚠️ “human video 预训练正迁移”：真机只跑了 3 个任务、54 rollouts，统计力有限；且 SSv2 是 curated 的”人手+物体”视频，不等于 YouTube random 视频
⚠️ “first unsupervised method for VLA”：VPT (Baker 2022)、Schmidt & Jiang (2024) 也做过类似思路，“first” 严格说是”first monolithic VLA on real robots with this pipeline”
❌ “open up the potential for leveraging web-scale data”：未来愿景式 claim，不是本文已验证的结论

Notes

Latent action 的正确抽象层次是 open question：LAPA 的 $8^{4}$ 适合 tabletop manipulation，但对 whole-body / dexterous / navigation 可能都不对。后续 GR00T、DreamGen 等工作各自在调这个选择
Cross-embodiment insight 可推广性存疑：ActionVLA 过拟合是因为 WidowX→Franka 的 action space 真的不同；如果是 Franka→Franka 的 cross-task，ground-truth action pretraining 是否仍输于 LAPA？论文没给
与 Flow Matching / diffusion action head 范式的关系：π0、π0.5、SmolVLA 等走的是 continuous action head 路线，LAPA 走 discrete latent。两条路线的融合（如用 latent 做 plan、continuous head 做 control）是自然的下一步——Hi Robot、DreamGen 已在这方向
Latent action 作为 world model 的 interface：Figure 7 的 closed-loop rollout 暗示 LAPA 可以做 test-time planning（生成 N 条 latent trajectory → score → 选最好的）。这是 2025+ VLA + test-time scaling 的一条有希望的路径

Rating

Metrics (as of 2026-04-22): citation=209, influential=37 (17.7%), velocity=11.5/mo; HF upvotes=2; github 510⭐ / forks=36 / 90d commits=0 / pushed 454d ago · stale

分数：2 - Frontier

理由：influential citation 比例 17.7% 显著高于典型 ~10%（37/209），citation velocity 11.5/mo 持续稳定 18 个月——说明 latent action from video 这条路线被后续工作实质继承（DreamGen、UWM、GR00T、EchoVLA 等），是该子方向的代表工作之一；510 stars 虽不算爆款但 37 fork 在 VLA 细分里不低。不给 3 的原因：(1) 方法本身是 Genie 在 VLA 域的迁移 + 工程化，原创性属于 “nice synthesis” 而非奠基；(2) repo stale（pushed 454d ago、90d commits=0），作者已转向新项目，代码不再是主力参考实现；(3) latent action 作为范式仍在快速演化（hierarchical / continuous latent / hybrid with flow matching），LAPA 的具体 recipe 不太可能成为 de facto standard。比 1 高的原因：这是”video-only VLA pretraining”方向必引的 baseline，跨 embodiment 的关键 negative result 仍在被引用和延伸。

MindFlow

Explorer

LAPA: Latent Action Pretraining from Videos

Summary

方法

Overall Pipeline

1. Latent Action Quantization

2. Latent Pretraining

3. Action Finetuning

实验

4.1 Benchmarks

4.2 Baselines

4.3 Language Table 结果

4.4 真机结果

4.6 预训练效率

5.1 Scaling 分析

5.2 Latent Action Analysis

关联工作

基于

对比

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents