π0: A Vision-Language-Action Flow Model for General Robot Control

Summary

π0: A Vision-Language-Action Flow Model for General Robot Control

核心: 首个结合预训练 VLM 与 flow matching 的 VLA，在 10,000+ 小时多机器人数据上预训练，实现前所未有的灵巧操作泛化

方法: PaliGemma backbone + action expert MoE + flow matching action generation + pre-train/post-train recipe

结果: Zero-shot 和 fine-tuning 评估中大幅超越 OpenVLA/Octo；首次展示折叠衣物、组装纸箱等长 horizon 灵巧任务

Sources: paper | website | github

Rating: 3 - Foundation（事实上的 flow-matching VLA 奠基工作，后续主流 VLA 工作的必引 baseline，openpi 已成为社区标杆）

Key Takeaways:

Flow matching + VLM 架构: 在预训练 VLM (PaliGemma) 上引入 action expert 和 flow matching，输出 50Hz 连续动作分布，兼顾语义理解和灵巧控制
Pre-training / Post-training recipe: 类似 LLM 的两阶段训练——大规模多样预训练提供广泛物理知识和纠错能力，高质量后训练数据教模型流畅执行
Cross-embodiment 规模: 7 种机器人配置、68 个任务、10,000+ 小时数据的统一预训练，在 zero-shot 和 fine-tuning 任务上大幅超越 OpenVLA 和 Octo

Teaser. π0 controls a mobile manipulator to fold laundry — pre-trained on data from 7 robot configurations and 68 tasks.

Introduction

Robot learning 面临数据、泛化、鲁棒性三大瓶颈。类比 NLP 和 CV 中的 foundation model，作者认为 generalist robot policy（即 robot foundation model）可以通过大规模多样数据的预训练来解决这些问题——就像鸟类识别最好先在大规模 image-language 数据上预训练再微调。

核心贡献：

架构：基于 VLM 预训练 + flow matching 的新型 generalist robot policy
训练 recipe：pre-training / post-training 两阶段，直觉是——只用高质量数据模型不会从错误中恢复，只用低质量预训练数据模型不会高效执行，两者结合才能得到理想行为
评估：zero-shot control、language instruction following、fine-tuning to downstream tasks，覆盖叠衣服、清理餐桌、组装箱子等高难度任务

Overview

Figure 3. Framework overview: pre-training mixture → flow matching VLA with VLM backbone + action expert, initialized from PaliGemma.

训练流程分两阶段：

Pre-training: 自有灵巧操作数据集（7 种机器人、68 个任务）+ OXE 开源数据集（22 种机器人）；使用 task name + segment annotation（~2s 子轨迹细粒度标注）作为 language label
Post-training: 高质量 task-specific 数据微调——简单任务 5 小时，复杂任务 100+ 小时

模型基于 PaliGemma VLM，加入 flow matching 动作输出，形成 VLA。PaliGemma 被选择是因为其 3B 参数规模在性能和实时控制间取得平衡。

The π0 Model

π0 的核心设计：

VLM backbone (PaliGemma, 3B params): 标准 late fusion VLM，图像编码器将 RGB 图嵌入与 language token 相同的 embedding space
Action expert (~300M params): 单独一组 transformer weights 处理 robot state 和 action tokens，通过 self-attention 与 VLM backbone 交互（类似 2-expert MoE）
Flow matching: 建模连续动作分布 $p (A_{t} ∣ o_{t})$ ，其中 action chunk $A_{t} = [a_{t}, ..., a_{t + H - 1}]$ ， $H = 50$

观测 $o_{t} = [I_{t}^{1}, ..., I_{t}^{n}, ℓ_{t}, q_{t}]$ ，包含多张 RGB 图、language command、proprioceptive state。

Equation 1. Conditional flow matching loss

L^{τ} (θ) = E_{p (A_{t} ∣ o_{t}), q (A_{t}^{τ} ∣ A_{t})} ∥ v_{θ} (A_{t}^{τ}, o_{t}) - u (A_{t}^{τ} ∣ A_{t}) ∥^{2}

符号说明: 下标为 robot timestep，上标 $τ \in [0, 1]$ 为 flow matching timestep；使用 linear-Gaussian probability path $q (A_{t}^{τ} ∣ A_{t}) = N (τ A_{t}, (1 - τ) I)$ 。含义: 训练时采样随机噪声 $ϵ$ ，计算 noisy actions $A_{t}^{τ} = τ A_{t} + (1 - τ) ϵ$ ，网络输出匹配 denoising vector field $u = ϵ - A_{t}$ 。

Equation 2. Forward Euler integration (inference)

A_{t}^{τ + δ} = A_{t}^{τ} + δ v_{θ} (A_{t}^{τ}, o_{t})

含义: 推理时从 $τ = 0$ 到 $τ = 1$ 积分，使用 10 步 ( $δ = 0.1$ )，observation prefix 的 KV cache 可复用。

关键架构细节：

Attention mask: 三块 blockwise causal——(1) images + language、(2) proprioceptive state、(3) noisy actions。每块内部 bidirectional，跨块因果。设计原因：保持 VLM 预训练分布不偏移，state 的 KV 可缓存
Action expert 下缩: width=1024, mlp_dim=4096（VLM backbone 是 width=2048, mlp_dim=16384），加速推理中的多次 forward pass
Flow matching timestep 采样: 使用 shifted Beta 分布 $p (τ) = Beta (\frac{s - τ}{s}; 1.5, 1)$ ，强调低 timestep（高噪声），因为 action prediction 不同于 image synthesis——observation 对 action 的约束远强于 text label 对 image 的约束
π0-small: 470M 参数消融版，不用 VLM 初始化，使用 DistilBERT 编码 language、DiT architecture action expert、encoder-decoder 结构

Table 1. Inference time on NVIDIA RTX 4090

Model Part	Inference Time
Image encoders	14 ms
Observation forward pass	32 ms
x10 action forward pass (flow)	27 ms
Network latency (if off-board)	13 ms
Total on-board inference	73 ms
Total off-board inference	86 ms

Action chunk 开环执行：50Hz 机器人每 0.5s 推理一次（25 个 action），20Hz 机器人每 0.8s 一次（16 个 action）。不使用 temporal ensembling（实验发现会降低性能）。

Data Collection and Training Recipe

Pre-training and post-training

Figure 4. Dataset overview: pre-training mixture weights and relative sizes.

数据组成：

9.1%: 开源数据（OXE、Bridge v2、DROID）——覆盖广泛物体和环境，低频控制（2-10 Hz）
90.9%: 自有数据——903M timesteps，其中 106M 单臂、797M 双臂，68 个任务
每个 task-robot 组合按 $n^{0.43}$ 加权，down-weight 过度代表的组合
Configuration 和 action 向量统一为最大维度 18（双 6-DoF 臂 + 2 gripper + mobile base + torso），低维机器人零填充

Post-training 使用 task-specific 高质量数据，简单任务 5 小时起，复杂任务 100+ 小时。直觉：多样但低质量的预训练数据让模型能从错误中恢复；高质量后训练数据教模型流畅高效地执行。

Language and high-level policies

复杂需要语义推理的任务（如 table bussing），使用高层 VLM 将高层命令分解为中间子任务语言指令（类似 SayCan），π0 作为低层执行策略。

Robot system details

Figure 5. The 7 robot configurations used for training.

7 种配置：UR5e 单臂（7D）、双臂 UR5e（14D）、Franka（8D）、双臂 Trossen/ALOHA 配置（14D）、双臂 ARX/AgileX（14D）、Mobile Trossen/ARX（16D action）、Mobile Fibocom 全向底盘（17D action）。

Experimental Evaluation

Evaluating the base model

Figure 6. Zero-shot evaluation tasks.

Figure 7. Zero-shot evaluation results.

5 个 zero-shot 任务（shirt folding、bussing easy/hard、grocery bagging、toast out of toaster），对比 OpenVLA (7B)、Octo (93M)、π0-small (470M)。结果：

π0 在所有任务上大幅领先（near perfect on shirt folding 和 bussing easy）
即使 compute parity 版（160k steps vs. 700k）也超越所有 baseline
π0-small > OpenVLA > Octo，说明 flow matching + action chunking 的架构优势
OpenVLA 困难在于其 autoregressive discretization 不支持 action chunk；Octo 的 diffusion 能力有限

Following language commands

Figure 8. Language-conditioned evaluation tasks.

Figure 9. Language evaluation results.

3 个任务（bussing、table setting、grocery bagging），5 种条件：

flat: 只给总任务描述
human: 人类提供中间步骤指令
HL: 高层 VLM 自动提供中间指令

π0 的 language following 准确度显著优于 π0-small，VLM 预训练直接转化为更好的语言指令跟随能力。π0-human > π0-HL > π0-flat，说明中间语言指令确实有帮助。π0-small 因语言理解差，即使加 human guidance 也提升有限。

Learning new dexterous tasks

Figure 10. Fine-tuning evaluation tasks across difficulty tiers.

Figure 11. Fine-tuning results with varying data amounts.

5 个新任务分三个难度等级：

Easy（与预训练类似）: stack bowls、towel folding
Medium（部分新元素）: tupperware in microwave
Hard（全新）: paper towel replacement、Franka items in drawer

对比 ACT、Diffusion Policy、OpenVLA、Octo。π0 consistently 最优，且预训练带来更大提升尤其在数据量小时（1h data 时 tupperware 任务显著优于 baseline）。有趣的是，prior methods 中从头训练的（ACT、Diffusion Policy）反而比预训练微调的（OpenVLA、Octo）更强，说明有效利用预训练是这些先前方法的主要挑战。

Mastering complex multi-stage tasks

Figure 12. Complex multi-stage tasks: laundry, bussing, box building, egg packing, to-go box.

Figure 13. Post-training results on complex tasks.

7 个高难度长 horizon 任务（5-20 分钟），包括 laundry folding（从随机揉皱状态开始折叠多件衣服）、mobile laundry、mobile dryer unloading、table bussing（12 种新物体）、box building、egg packing、to-go box packing。

对比三个消融：full pre-train + fine-tune vs. zero-shot vs. from scratch。结果：

完整 recipe（预训练+后训练）在所有任务上最优
预训练对更难任务的提升更大（box building、egg packing 等从头训练几乎无法完成）
无先前方法能解决这些任务，代表了 learned policy 灵巧操作的新 SOTA

Video. π0 autonomously unloads dryer and folds clothes (single policy, uncut).

Discussion, Limitations, and Future Work

局限性和未来方向：

预训练数据如何组成的理解不够充分——目前是把所有可用数据组合在一起
并非所有任务都能可靠工作，如何预测需要多少/什么类型数据仍是开放问题
跨更大领域（自动驾驶、导航、足式运动）的正向迁移尚待验证
未来方向：long-horizon reasoning and planning、autonomous self-improvement、robustness、safety

关联工作

基于

PaliGemma: 3B 参数 VLM，作为 π0 的 backbone 初始化
Transfusion: 单 transformer 多目标训练——continuous tokens 用 flow matching loss，discrete tokens 用 cross-entropy loss
Flow Matching / Conditional Flow Matching: 使用 linear-Gaussian probability path 的连续动作分布生成
Open X-Embodiment (OXE): 22 种机器人的开源 cross-embodiment 数据集，作为预训练混合的一部分
Mobile ALOHA: 双臂移动操作平台，π0 的部分机器人配置基于此

对比

OpenVLA: 7B VLA，autoregressive discretized actions，不支持 action chunk，在高频灵巧任务上表现差
Octo: 93M 参数，支持 diffusion action 输出但表示能力有限
ACT: Action Chunking with Transformers，from-scratch 训练的灵巧操作基线
Diffusion Policy: 基于 diffusion 的视觉运动策略学习

方法相关

Action Chunking: 预测未来 H 步动作序列而非单步，π0 使用 H=50
SayCan: VLM 高层规划分解任务为子目标语言指令，π0 类似地使用 high-level VLM policy
Mixture of Experts: action expert 的设计类似 2-expert MoE，不同 token 路由到不同权重

论文点评

Strengths

Architecture 设计精妙: action expert 与 VLM backbone 通过 self-attention 交互的 MoE 设计，既保留了 VLM 预训练权重不被破坏，又允许 action-specific 的灵活表示；blockwise causal attention mask 和 KV caching 策略工程上很优雅
Pre-train / post-train recipe: 明确类比 LLM 训练范式，给 robotics community 一个清晰的方法论——多样数据学鲁棒性，高质量数据学流畅性
评估任务难度空前: 折叠多件衣服、组装纸箱、打包鸡蛋等长 horizon 灵巧任务远超此前 VLA 评估的 “pick up the cup” 级别
规模性: 10,000+ 小时数据、7 种机器人配置——这种规模本身就是重要贡献

Weaknesses

数据不可复现: 90.9% 数据为私有，核心竞争力在数据而非纯架构，学术界难以完全复现
公平比较受限: OpenVLA 和 Octo 在作者的高频复杂数据混合上训练天然不利（OpenVLA 不支持 action chunk），π0-small 同时更小且无 VLM 初始化，难以分离各因素贡献
泛化范围有限: 所有任务仍为桌面/室内操作，对更广泛的 embodied 场景（户外、动态环境、多 agent）未做探索
Scaling law 缺失: 未分析数据量/模型大小/计算量与性能的 scaling 关系

可信评估

Artifact 可获取性

代码: inference + training（openpi repo 提供完整训练和推理代码，支持 JAX 和 PyTorch）
模型权重: π0 base、π0-FAST base、π0-FAST-DROID、π0-DROID、π0-ALOHA-towel、π0-ALOHA-tupperware、π0-ALOHA-pen-uncap（均通过 GCS bucket 发布）
训练细节: 仅高层描述——提到 700k steps、 $n^{0.43}$ 加权、Beta timestep sampling 等，但完整超参和训练配置未在论文中详尽披露
数据集: 部分公开——OXE、Bridge v2、DROID 为开源；自有 dexterous manipulation 数据（903M timesteps）私有

Claim 可验证性

✅ Zero-shot 性能优于 OpenVLA/Octo: 论文提供详细评估 rubric、10 episode 平均、多 task 比较，且开源模型可复现部分实验
✅ Flow matching 优于 autoregressive discretization for dexterous control: 架构分析合理，OpenVLA 的 failure 模式（不支持 action chunk、高频控制）有清晰解释
⚠️ “Most capable and dexterous generalist robot policy to date”: 任务确实比 prior work 难很多，但缺乏与 ACT/Diffusion Policy 在复杂多阶段任务上的直接对比（仅在简单 fine-tuning 任务上比较）
⚠️ Pre-training 的独立贡献: 预训练 vs. from scratch 比较在复杂任务上令人信服，但预训练数据的组成对结果的影响未充分分析

Notes

Rating

Metrics (as of 2026-04-24): citation=1480, influential=286 (19.3%), velocity=83.62/mo; HF upvotes=31; github 11498⭐ / forks=1828 / 90d commits=22 / pushed 8d ago

分数：3 - Foundation 理由：π0 已成为 flow-matching VLA 范式的奠基工作——openpi 是 VLA 方向最被采用的开源 codebase 之一（GitHub 上 star 数千、社区持续贡献），后续主流 VLA 工作（π0.5、GR00T N1、SmolVLA、OpenVLA-OFT 等）均将其作为必引 baseline 或直接复用其架构。相比 2 - Frontier，它不是”最新 SOTA 之一”而是已定型为方向的参考点；相比被取代的老工作，其架构（VLM + action expert MoE + flow matching）仍在被活跃复用和扩展，属于 field 必读。

MindFlow

Explorer

π0: A Vision-Language-Action Flow Model for General Robot Control

Summary

Introduction

Overview

The π0 Model

Data Collection and Training Recipe

Pre-training and post-training

Language and high-level policies

Robot system details

Experimental Evaluation

Evaluating the base model

Following language commands

Learning new dexterous tasks

Mastering complex multi-stage tasks

Discussion, Limitations, and Future Work

关联工作

基于

对比

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents