Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Summary

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

核心: 用一个 diffusion transformer 同时对 action 和 future observation 做扩散，给两个模态独立的 diffusion timestep，通过 timestep→mask 的对应关系把 policy / forward dynamics / inverse dynamics / video prediction 统一到同一模型里，并天然支持 action-free video 共训（把 action timestep 固定到 $T$ 即可）

方法: coupled noise prediction $s_{θ} (o, a_{t_{a}}, o_{t_{o^{'}}}^{'}, t_{a}, t_{o^{'}})$ ，两个 timestep 独立采样训练；推理时通过固定 $t_{a} \in {0, T}$ 或 $t_{o^{'}} \in {0, T}$ 在四个条件 / 边缘分布之间切换

结果: DROID 2K 轨迹预训练 + 5 个真实 Franka 任务微调，ID/OOD 平均显著超过 DP / PAD / GR1；加 2K 条 action-free DROID 视频共训再涨（Stack-Bowls OOD 0.76→0.84），LIBERO-10 OOD 平均 0.79 也是第一

Sources: paper | website | github

Rating: 2 - Frontier（“timestep ≡ soft mask” 是 elegant reframing 且 RSS 2025 accept，但社区采纳信号弱（212⭐/12.6mo、HF=4、repo stale），暂归 Frontier 待后续工作承接）

Key Takeaways:

Timestep = soft mask：扩散 timestep 的”噪声等级”和”掩码”本质同构—— $t = T$ 等价于把该 token 当 pure Gaussian 条件（marginalize）， $t = 0$ 等价于提供 clean 观测（condition）。独立控制每个模态的 timestep 就获得 mask-based 多任务训练的连续版，这是比 PAD 那种共享 timestep 更干净的架构。
为什么 unified 比 BC 更 generalizable：同样的 $(o, a, o^{'})$ 数据，UWM 被迫学全部 conditional/marginal 的 noise 预测，相当于免费加了 dense 的 pixel reconstruction 监督 + 隐式 action ↔ 未来像素的因果结构。Table VIII 的 “reconstruct current obs” 消融（0.70 vs 0.86）证明 gain 不是单纯 pixel aux loss，而是 future dynamics 预测本身。
Action-free video 的正确打开方式：用 diffusion timestep 做连续 mask，比 GR1 / PAD 的 learnable mask token 鲁棒得多——GR1 在 3/5 任务上共训反而变差（video 稀释 action signal），UWM 5/5 都提升。这给”video 作为预训练信号”提供了一个干净的架构方案。
Registers 不是装饰：action 和 image latent 是异质 token，transformer 没有额外 “scratchpad” 可用于模态间信息交换；8 个 register tokens 在 Book-Caddy 上从 0.81 拉到 0.88（Table VII），印证 ViT register 观察也适用于多模态 DiT。
Limits：尚未跨 embodiment（internet video 共训只有轻微提升 0.88 vs robot-video 0.92）；forward dynamics 视觉质量有 artifact，因此 planning 用该 world model 的潜力未展示。

Teaser. UWM 架构总览——统一的 DiT backbone + 两个独立的 diffusion timestep 控制 action 和 video 模态。

Background: Diffusion Model 与 Conditional Generation

UWM 建立在 DDPM 的标准框架上。数据 $x_{0} \sim p (x_{0})$ 经 $T$ 步高斯噪声加噪得到 $x_{t}$ ，网络 $s_{θ}$ 以 denoising score matching 学习噪声预测：

θ min E_{x_{0}, t, ϵ} [∥ s_{θ} (x_{t}, t) - ϵ ∥_{2}^{2}], x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ

条件生成时只需把条件变量 $z_{0}$ concat 进网络输入： $s_{θ} (x_{t}, z_{0}, t)$ 。UWM 的 key insight 就是——当多个变量同时存在时，给每个变量独立的 timestep，就能在一个网络里同时表达各种 conditional / marginal 分布。

Method: Coupled Video-Action Diffusion

Problem Setup

给定机器人数据集 $D_{e} = {(o_{i}, a_{i}, o_{i}^{'})}$ 和可选的 action-free 视频数据 $D_{af} = {(o_{i}, o_{i + 1})}$ ，目标是同时训练以下四个模型：

Policy $p (a ∣ o)$
Forward dynamics $p (o^{'} ∣ o, a)$
Inverse dynamics $p (a ∣ o, o^{'})$
Video prediction $p (o^{'} ∣ o)$

UWM 把它们统一为同一个 diffusion 网络。

Decoupled Timesteps

出发点：训练一个 joint noise predictor $s_{θ} (o, a_{t}, o_{t}^{'}, t)$ 只能采样联合分布 $p (a, o^{'} ∣ o)$ 。为了灵活推理，UWM 解耦 action 和 next obs 的 timestep：

s_{θ} (o, a_{t_{a}}, o_{t_{o^{'}}}^{'}, t_{a}, t_{o^{'}}) \approx E [ϵ_{a}, ϵ_{o^{'}} ∣ o, a_{t_{a}}, o_{t_{o^{'}}}^{'}]

关键观察：diffusion timestep 对应一种连续的 mask—— $t_{o^{'}} = T$ 时 $o_{T}^{'}$ 约为纯 Gaussian，对 action 预测不提供信息，等价于 marginalize $o^{'}$ ； $t_{o^{'}} = 0$ 时 $o_{0}^{'} = o^{'}$ 是 clean ground truth，等价于 condition on $o^{'}$ 。

Training Loss（Eq. 1）

ℓ (θ) = E_{(o, a, o^{'}) \sim D, t_{a}, t_{o^{'}} \sim U (0, T), ϵ_{a}, ϵ_{o^{'}} \sim N (0, I)} [w_{a} ∥ ϵ_{a}^{θ} - ϵ_{a} ∥_{2}^{2} + w_{o^{'}} ∥ ϵ_{o^{'}}^{θ} - ϵ_{o^{'}} ∥_{2}^{2}]

两个 timestep 独立均匀采样——这是和 PAD 最核心的架构区别。实现上 $w_{a} = w_{o^{'}} = 1.0$ 。

Flexible Inference

固定 timestep 的四种组合对应四类模型：

Table A. Timestep → Inference mode 对照

Mode	$t_{a}$	$t_{o^{'}}$	Sample from
Policy	反向 denoise $T \to 0$	$T$	$p (a ∣ o)$
Video prediction	$T$	反向 denoise $T \to 0$	$p (o^{'} ∣ o)$
Forward dynamics	$0$ （clean $a$ ）	反向 denoise $T \to 0$	$p (o^{'} ∣ o, a)$
Inverse dynamics	反向 denoise $T \to 0$	$0$ （clean $o^{'}$ ）	$p (a ∣ o, o^{'})$

训练和推理 pipeline：

Architecture

DiT backbone + AdaLN conditioning。

关键组件：

Observation encoder: 每个 camera 每帧过 ResNet-18（ImageNet 预训练），特征 flatten + concat + 和两个 timestep sinusoidal embedding 一起作为 AdaLN 条件
Image tokens: frozen SDXL VAE 把 $(224, 224, 3)$ 压到 $(28, 28, 4)$ latent，再用 $(4, 4, 2)$ spatiotemporal patchifier → 图像 patch embeddings
Action tokens: action chunk（长度 $h_{a} = 16$ ）每步过 shallow MLP encoder
Registers: $N_{r} = 8$ 个随机初始化可学习 token，消融显示比 0 / 4 都好
Sampler: DDIM，训练 100 步，推理 10 步
Action chunking: 预测 $h_{a} = 16$ ，执行前 $h_{a}^{'} = 8$ 再 replan
Compute: 4×A100, 24h on DROID 100K-step pretraining

Cotraining on Action-Free Video

对视频样本 $(o, o^{'})$ ，把 action timestep 固定为 $t_{a} = T$ ，把 missing action 填为 $ϵ_{a} \sim N (0, 1)$ ，用同一个 Eq. 1 loss 训练。由于 $t_{a} = T$ 时模型应当”忽略” $a_{t_{a}}$ （它是纯噪声），action loss 项退化为无意义项，真正的梯度信号来自 video prediction。这比 GR1 / PAD 的 learnable mask token 更自然——mask 就是 timestep 的极值。

Experiments

Real-Robot (DROID Platform)

从 DROID 采 2000 条轨迹做预训练；另取 2000 条去 action 作为共训视频。5 个 Franka 任务：Stack-Bowls / Block-Cabinet / Paper-Towel / Hang-Towel (deformable) / Rice-Cooker (long-horizon)。每任务 ID + OOD（加 distractor）评估，每条件 50 次随机初始化（Rice-Cooker 20 次）。

Table I. Real robot success rates (Pretrain / Cotrain). UWM 在 ID 和 OOD 上都稳定第一，且视频共训进一步提升。

	Stack-Bowls ID	Stack-Bowls OOD	Block-Cabinet ID	Block-Cabinet OOD	Paper-Towel ID	Paper-Towel OOD	Hang-Towel ID	Hang-Towel OOD	Rice-Cooker ID
UWM (Ours)	0.86 / 0.92	0.76 / 0.84	0.76 / 0.84	0.60 / 0.72	0.78 / 0.86	0.78 / 0.84	0.82 / 0.86	0.64 / 0.76	0.60 / 0.65
DP	0.48 / –	0.36 / –	0.60 / –	0.26 / –	0.52 / –	0.48 / –	0.64 / –	0.28 / –	0.35 / –
PAD	0.08 / 0.20	0.08 / 0.12	0.00 / 0.00	0.00 / 0.00	0.42 / 0.42	0.34 / 0.44	0.52 / 0.54	0.30 / 0.38	0.00 / 0.00
GR1	0.66 / 0.62	0.48 / 0.38	0.66 / 0.74	0.44 / 0.64	0.60 / 0.46	0.60 / 0.46	0.66 / 0.66	0.48 / 0.44	0.40 / 0.25

关键对比：

vs DP：DP 没法利用多任务数据中的 pixel 信息，ID 上已经被大幅超越（平均 $\sim$ 0.5 vs UWM 0.76），OOD 差距拉得更大
vs PAD（shared timestep）：PAD 彻底失败（Block-Cabinet / Rice-Cooker 0%）。归因于 PAD 把 raw pixel concat 进 noisy next obs 做 channel-wise conditioning，图像编码负担压给同一个 transformer，capacity 不足；此外 shared timestep 不支持 marginal policy inference
vs GR1（regression）：GR1 是次优基线，但视频共训 3/5 任务反降——video 稀释了 action signal。UWM 5/5 任务视频共训都提升，差距最大的是 Hang-Towel OOD（0.64→0.76）

Categorized OOD（Table IV）

更系统的 OOD：lighting（L1/L2：static/Disco 灯）、background（B1/B2）、clutter（C1/C2），每种 5 次。

Table IV. Categorized OOD: Cotrain 版本最 robust，提升最显著的是 lighting 和 background distractors。

	Stack-Bowls UWM(Co)	Stack-Bowls UWM(Pre)	Stack-Bowls DP	Block-Cabinet UWM(Co)	Block-Cabinet UWM(Pre)	Block-Cabinet DP
L1	4/5	4/5	2/5	5/5	5/5	3/5
L2	3/5	2/5	2/5	4/5	0/5	0/5
B1	4/5	3/5	3/5	4/5	3/5	2/5
B2	3/5	1/5	2/5	1/5	0/5	0/5
C1	3/5	2/5	2/5	0/5	0/5	0/5
C2	4/5	3/5	1/5	1/5	0/5	0/5
All	21/30	15/30	12/30	15/30	8/30	6/30

LIBERO-100 Simulation

LIBERO-90 预训练 + LIBERO-10 中 5 任务微调，OOD 设置：扩大初始化范围 + 删除背景物体。

Table II. LIBERO-10 OOD success rates.

	Book-Caddy	Soup-Cheese	Bowl-Drawer	Moka-Moka	Mug-Mug	Average
UWM	0.91 ± 0.07	0.93 ± 0.01	0.80 ± 0.02	0.68 ± 0.02	0.65 ± 0.01	0.79 ± 0.11
DP	0.73 ± 0.10	0.88 ± 0.02	0.77 ± 0.02	0.65 ± 0.03	0.53 ± 0.05	0.71 ± 0.12
PAD	0.78 ± 0.04	0.47 ± 0.04	0.74 ± 0.05	0.59 ± 0.08	0.25 ± 0.04	0.57 ± 0.19
GR1	0.77 ± 0.03	0.65 ± 0.05	0.62 ± 0.03	0.46 ± 0.04	0.38 ± 0.05	0.58 ± 0.14

OOD 差距不如真机大，作者归因于仿真动力学太简单。

Ablations

Table VII. Design choices (trained from scratch on single task).

	Book-Caddy	Soup-Cheese
UWM w/ 8 registers	0.88 ± 0.04	0.90 ± 0.02
UWM w/ 4 registers	0.83 ± 0.05	0.86 ± 0.03
UWM w/o registers	0.81 ± 0.07	0.85 ± 0.03
Cross-attn conditioning	0.78 ± 0.05	0.86 ± 0.04

Table VIII. Future vs current obs reconstruction（证明 gain 来自动力学而不是纯 aux loss）.

	Stack-Bowls	Block-Cabinet
UWM Reconstruct Future Obs	0.86	0.76
UWM Reconstruct Current Obs	0.70	0.66
DP (No Reconstruction)	0.48	0.60

Table IX. Internet video 共训仍有轻微正向但不及 in-domain 机器人视频，说明 embodiment gap 仍是瓶颈。

	Stack-Bowls	Block-Cabinet
UWM Robot + Robot Videos	0.92	0.84
UWM Robot + Internet Videos	0.88	0.80
UWM Robot only	0.86	0.76

Forward / Inverse Dynamics

Forward dynamics 能给定 clean action 预测未来图像（视觉上和 GT 接近，但有 VAE 解码 artifact）。Inverse dynamics 在 trajectory tracking（给 GT future obs 反推 action）上 success 0.65 / 0.55，明显高于 policy 在同样步数的 0.47 / 0.26（Table III）——actions 更贴合 reference trajectory。

Scaling with Pretraining

UWM from-scratch 和 DP 持平；差距全部来自预训练阶段——说明 UWM 真正”吃到”了多任务数据里的动力学信号。

关联工作

基于

DDPM (Ho et al. 2020)：底层生成模型 framework，Eq. 1 的 denoising score matching loss
Diffusion Policy (Chi et al. 2023)：action diffusion for behavior cloning，UWM 扩展为同时扩散 action 和 future obs
Latent Diffusion / SDXL (Rombach et al. 2021, Podell et al. 2024)：frozen SDXL VAE 作为 image encoder
DiT + AdaLN (Peebles & Xie 2022)：backbone 架构
ViT Registers (Darcet et al. 2024)：8 个 register tokens 作为模态间信息交换的 “scratchpad”

对比

PAD (Guo et al. 2024)：joint video-action diffusion with shared timestep + channel-wise raw pixel concat conditioning。UWM 通过 decoupled timestep + encoder-based conditioning 证明两处改动都关键
GR1 (Wu et al. 2024)：regression-based video-action transformer + learnable mask token for action-free video。UWM 证明 diffusion + timestep-mask 比 regression + learned-token 更 robust
Diffusion Policy (Chi et al. 2023)：只 action diffusion，UWM 加 future obs diffusion 后大幅提升多任务 pretraining

方法相关

Cosmos / iVideoGPT：world model 作为机器人预训练范式（UWM 用 diffusion 而不是 autoregressive video token）
UniMask (Carroll et al. 2022)：用 token masking 做 sequential decision-making 的 unified inference——UWM 是 “mask ↔ diffusion timestep” 的连续版
UniDiffuser (Bao et al. 2023)：image+text 的 “one transformer fits all distributions”，UWM 把同一 insight 迁移到 video+action
Transfusion (Zhou et al. 2024)：autoregressive + diffusion 统一，和 UWM 一样追求 feature sharing 下的 multimodal unification
Diffusion Forcing (Chen et al. 2024)：next-token prediction + full-sequence diffusion，和 UWM 的 “per-token independent timestep” 思想最接近的并行工作
π₀ / Octo / OpenVLA：VLA 路线做大规模 robot foundation model，UWM 是另一条 “不依赖 VLM prior、只靠 diffusion + world model” 的路线

论文点评

Strengths

概念上的极简统一。用”timestep ≡ mask”这一个 trick 同时解决了 (a) 四个模型的统一推理、(b) action-free 视频共训两个看似无关的问题。这是典型的 “simple, scalable, generalizable” 方法——不需要额外的 mask token、不需要两阶段训练、不需要架构上的 if/else。
对 PAD 的 ablation-style 实验。把 PAD 的”shared timestep”和”channel-wise raw pixel concat”分离出来并实现在相同 backbone 上，证明性能差距来自架构而非实现细节。这种 baseline 严谨度在 robot learning 论文里不常见。
Ablation 设计回答关键 why：Table VIII 的 “reconstruct current obs” 消融直接否掉了”UWM 好是因为多了 image aux loss”的平凡解释，强化 future dynamics prediction 才是关键的 claim。
负面结果诚实汇报：GR1 共训 3/5 降点、PAD 彻底崩、Internet video 共训只有轻微提升、仿真上 OOD 差距缩小——这些 counter-to-narrative 数据都保留并给出 hypotheses。
代码 + 数据子集 + 超参完整披露：4×A100 24h 训一次 DROID-100K，预算门槛对学界可复现友好。

Weaknesses

Scale 很小。2K 条 DROID 轨迹（DROID 共 76K）+ 180M 参数级 DiT。这只能说明思想在小规模有效。π₀ / RDT / OpenVLA 这一档 VLA 的 scale 上，“unified diffusion” 是否还胜过 VLM-prior 架构是开放问题。论文没和 flow-matching VLA 直接比过。
Forward dynamics quality 是论文暗面。作者自承有 artifact 且没给定量 FVD / PSNR，只给定性帧。这意味着 “UWM as world model for planning” 的 downstream 应用（比如 MPC）基本没戏。UWM 真正 work 的是 policy，world-modeling branch 更像 auxiliary supervision——应该被这样定位。
ResNet-18 + frozen SDXL VAE 的组合很 2023。在当下 video / world model SOTA 用 DiT+大参数量 video decoder 的语境下，ResNet-18 observation encoder 是瓶颈，也限制了视频预测质量。这部分没深入探讨。
Inverse dynamics 评估弱。只做 “给 GT future obs” 的 tracking，没对照 “给 high-level visual goal image” 这种更实际的 goal-conditioned 场景。Inverse dynamics 作为独立贡献需要更严肃的 benchmark。
跨 embodiment 视频只用 Kinetics / SSv2 shallow 验证，human 视频 OOD 仅涨 +0.02-0.04。这意味着”用 Internet video scale 机器人学习”这个最诱人的 promise 尚未兑现。

可信评估

Artifact 可获取性

代码: inference+training（https://github.com/WEIRDLabUW/unified-world-model，PyTorch 实现）
模型权重: 未说明（README 未提 released checkpoint，仓库以训练脚本为主）
训练细节: 超参完整（Table V：embed_dim=768, depth=12, heads=12, registers=8, DDIM 10 步推理, lr=1e-4, bs=36×4）+ 训练步数（预训练 100K，微调 10K/20K/50K 随任务）+ 训练资源（4×A100, 24h）+ 数据配比（2K robot + 2K video，uniformly mix batches）—— 完整披露
数据集: DROID 开源（https://arxiv.org/abs/2403.12945）、LIBERO 开源；5 个真机 finetune 数据（50–150 demos/task）私有

Claim 可验证性

✅ UWM 在 DROID→5 真机任务上 ID/OOD 均超越 DP/PAD/GR1：Table I + Fig 6，每任务 50 initializations，overhead camera 追踪初始化（真机 eval protocol 严谨）
✅ Video cotraining 单调提升 UWM 5/5 任务：Table I 直接可读，Table IV categorized OOD 二次验证（21/30 vs 15/30）
✅ Future obs reconstruction 比 current obs reconstruction 显著更好：Table VIII 证明 gain 来自动力学而非 pure aux loss
✅ Registers 有效：Table VII 0/4/8 三档递增
⚠️ “Separate timesteps give model causal understanding between actions and observations”：这个 “causal understanding” 的 claim 只由 end-task performance 支撑，没有独立的 probe（比如 counterfactual action → obs 预测一致性检验）。是 plausible interpretation 但不是验证过的机制
⚠️ Inverse dynamics > policy on tracking（Table III）：成立但只在 2 个 LIBERO 任务，样本量小；且 metric 是 success rate 而非 action MSE，可能被 task-level noise 主导
⚠️ “Unified model benefits from feature sharing between actions and pixels”：registers 消融支持 “跨模态交换需要中介”，但 “feature sharing → 更好 policy” 仍是间接推断
无 ❌ 营销话术。作者的 claim 都是具体可核对的性能数字

Notes

❓ 为什么不直接做 flow matching 版本？ Flow matching 的 “timestep ≡ mask” 等价性如何？如果 FM 的 intermediate state $x_{t}$ 不满足 ” $t = T$ 即 pure Gaussian” 的干净边界，那 UWM 的 insight 要重新推导。值得试 UWM-FM 版本
❓ Scale 到 VLA 尺度会怎样？ UWM 是 180M 参数级（DiT-B），把 backbone 换成 2B-7B VLM（复用 PaLI / Gemma pretrain），action branch 保留现有 head，video branch 用 Cosmos 级 decoder——这种 “UWM × VLA” 组合是否能击败 π₀ 类纯 VLA？值得作为 follow-up
❓ Causal understanding 的 mechanistic evidence？可设计 counterfactual probe：固定 $o$ 、改变 $a$ ，测 forward dynamics 输出的 pixel change 是否对应物理一致的 robot motion。这能把 “causal” 的 claim 从推断升为证据
❓ Per-token timestep 泛化？当前只对 action / obs 两块模态用独立 timestep。能否进一步到 per-action-step timestep？——比如预测 action chunk 第 5 步时前 4 步 clean 后 11 步 noisy，自然得到 “partial rollout inverse dynamics”。这会进一步泛化 action-chunking 的训练 recipe
与 World Model 路线的对话：UWM 明确把 world modeling 降格为 “pretraining auxiliary for policy”，而不是像 Cosmos / DreamerV3 那样让 world model 独立服务 planning。这实际上是对 “world model + planning vs amortized policy” 的一次投票——论文偏向后者
数据效率的暗线：2K 真机 + 2K 视频 already 出现 20% absolute lift，如果 scale 到 DROID-full 76K（40×）效果会如何？from-scratch vs pretrain 的差距（Fig 10）提示 scaling 红利仍在初期

Rating

Metrics (as of 2026-04-24): citation=82, influential=8 (9.8%), velocity=6.46/mo · 12.7mo old; HF upvotes=4; github 212⭐ / forks=13 / 90d commits=0 / pushed 196d ago · stale

分数：2 - Frontier

理由：2026-04 复核降档。“diffusion timestep ≡ soft mask” 确是 unified video-action diffusion 的 elegant reframing（Strengths 列出的 simple/scalable/generalizable 判断、对 PAD 的分离、Table VIII 对 “pure aux loss 解释”的反驳），RSS 2025 accept + DROID real-robot 验证也是 Frontier 档依据。降到 2 档的依据：metrics 显示社区采纳信号弱——212⭐ / 12.6mo（对比同期 OpenVLA 5959⭐/22mo、Pi0 11456⭐/18mo 差 20–50 倍），HF upvotes=4，repo is_stale（pushed 196d、0 commits in 90d），且 citation 数据暂缺时找不到”被广泛 build on”的证据。Frontier 档对应”方法范式代表工作但尚未定型成标准”与当前信号吻合；若后续 Diffusion Forcing / unified-diffusion 类工作明确以 UWM 为 reference，可再升回 3。

MindFlow

Explorer

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Summary

Background: Diffusion Model 与 Conditional Generation

Method: Coupled Video-Action Diffusion

Problem Setup

Decoupled Timesteps

Training Loss（Eq. 1）

Flexible Inference

Architecture

Cotraining on Action-Free Video

Experiments

Real-Robot (DROID Platform)

Categorized OOD（Table IV）

LIBERO-100 Simulation

Ablations

Forward / Inverse Dynamics

Scaling with Pretraining

关联工作

基于

对比

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents