RT-H: Action Hierarchies Using Language

Summary

RT-H: Action Hierarchies Using Language

核心: 用 language motions（如 “move arm forward”）作为 high-level task 和 low-level action 之间的中间层，构建 action hierarchy

方法: 基于 RT-2 (PaLI-X 55B VLM)，两阶段查询——先预测 language motion，再基于 language motion + task + observation 预测 action；language motions 从 proprioception 自动标注

结果: 在 Diverse+Kitchen 数据集上比 RT-2 高 15%；language motion correction 比 teleop correction (IWR) 高 50%

Sources: paper | website

Rating: 2 - Frontier（language-as-intermediate-action 范式的代表工作，被后续 hierarchical VLA 引用，但未成为 de facto 标准）

Key Takeaways:

Language motions 作为中间层: 用细粒度的 language（如 “move arm forward”, “close gripper”）描述 low-level motion，作为 high-level task 和 low-level action 之间的桥梁。即使语义上差异很大的任务（如 “pick coke can” vs “pour cup”），在 language motion 层面可以共享数据
自动标注: Language motions 从 robot proprioception 自动提取，无需人工标注，组合产生超过 2500 种 language motions
Language correction: Language motions 提供了直观的 correction 接口——人类可以用 language 指定 motion correction，比 teleoperation 更易用、更 sample efficient

Teaser. RT-H action hierarchy 概念图：给定 task 和 image，VLM 先预测 language motions，再基于 language motions 预测 actions；人类可通过 language motion corrections 介入

Video 1. RT-H 概述视频

Introduction

Language-conditioned policy 利用语言的 compositional structure 来在语义相似的 task 之间共享数据。但当 task 语义多样化时（如 “pick coke can” vs “pour cup”），仅靠 high-level task description 难以学到 shared structure。

RT-H 的 insight：teach the robot the language of actions。用更细粒度的 language motions 描述 low-level motions，作为 high-level task 和 low-level action 之间的 intermediate prediction layer。

Action hierarchy 的三个好处：

Better data sharing：不同 task 在 language motion 层面高度重叠（如 “pour cup” 和 “pick coke can” 在 pick 阶段共享 language motions）
Contextual actions：language motions 在 task 和 scene 的 context 下被 learned，“move arm forward” 的具体 action 取决于当前 scene
Language correction：人类可以用 language 直接 correct policy，比 teleoperation 更直观且 sample efficient

RT-H: Action Hierarchies using Language

Formalizing Action Hierarchies

给定数据集 $D = {(τ_{1}, g_{1}), \dots, (τ_{N}, g_{N})}$ ，每条 demonstration 包含 observation、intermediate action $z$ （language motion）和 low-level action $a$ 。学习两个 policy 的 composition：

π (a, z ∣ o, g) = π_{h} (z ∣ o, g) π_{l} (a ∣ o, g, z)

符号说明： $π_{h}$ 为 high-level policy（预测 language motion $z$ ）， $π_{l}$ 为 low-level policy（基于 language motion 预测 action $a$ ）， $o$ 为 observation， $g$ 为 task description。

RT-H: Model and Training Details

RT-H 使用与 RT-2 相同的 PaLI-X 55B 架构（ViT encoder + Encoder-Decoder Transformer）。单个模型学习两种 query：

Language motion query ( $π_{h}$ )：输入 task $g$ 和 image $o$ ，预测 language motion $z$ 。Prompt: “What skill should the robot do to [task]?”
Action query ( $π_{l}$ )：输入 task $g$ 、image $o$ 和 language motion $z$ （通过 Encoder），预测 tokenized action。Prompt: “What action should the robot do to [task], with current skill: [motion]?”

关键设计：两个 query 使用同一个 VLM，language motion $z$ 在 action query 中通过 Encoder 传入（而非 Decoder），使其能作为额外 context 指导 action prediction。

Figure 2. RT-H 架构概览

Training recipe：50% pre-training mixture + 25% language motion query + 25% action query。ViT encoder 冻结，learning rate 1e-3。

Extracting Language Motions

自动标注流程（不需人工标注）：

将 end effector pose 变化的每个维度映射到 spatial direction（如 z-axis → up/down）
确定当前 dominant spatial movements，过滤低于阈值的维度
按 action magnitude 排序组合（如 “move arm forward and close gripper”）

9 个 action dimensions（3 delta position + 3 delta orientation + 2 base movement + 1 gripper）的组合产生超过 2500 种 language motions。关键 trade-off：language motions 越细粒度，对 action query 帮助越大，但对 language motion query 预测越难。

RT-H: Inference & Correction

推理时 RT-H 需要顺序执行两个 query，对 55B 模型会有 latency 问题。两种解决方案：

Asynchronous querying（实验采用）：训练 language motion query 预测下一步的 skill，action query 用上一步的 inferred skill，两个 query batch 执行
Fixed frequency：每 $H$ 步执行一次 language motion query

Correction via Language Motions

Language motion correction 的流程：

人类观察执行，决定何时介入（如按键触发）
通过键盘或语音输入新的 language motion（如 “move arm left” 替代 “move arm forward”）
新 language motion 直接传入 action query 产生 contextual action

Learning from corrections: 只需更新 language motion query $π_{h}$ ，因为 action query 已经知道如何将 language motion 映射到 actions。Co-train original dataset（both queries）+ correction dataset（only language motion query），correction sample 的 upweight ratio 约 50:1。

Experiments

RT-H on Diverse Multi-Task Datasets

Dataset: Diverse+Kitchen (D+K)，100K demonstrations——Kitchen (70K, 6 semantic task categories from RT-1/RT-2) + Diverse (30K, 24 semantic task categories)。评估 8 个最难 task。

Figure 3. Diverse+Kitchen 数据集上的结果

核心结果：RT-H 平均比 RT-2 高 15%（6/8 tasks 更好）。RT-2 仅在 4/8 tasks 非零，RT-H 在 6/8 tasks 非零。

Ablation 分析：

RT-H-Joint（单 query 版本）：与 RT-H 表现相当，说明对 query 机制 robust
RT-H-Cluster（K-means cluster 替代 language motions）：略差于 RT-H，但在精细 task（open/close jar）上更好——cluster 提供更 fine-grained 的 action context，但缺少 language structure 导致 broad dataset 上预测更难
RT-H-OneHot（one-hot 替代 language）：显著下降，说明 language 的 structure 很重要

Table I. Offline validation MSE

Train Dataset	Eval Dataset	RT-2	RT-H-Joint	RT-H	RT-H (GT)
Kitchen	Kitchen	30.2	28.22	24.9	17.9
D+K	Diverse	27.7	25.44	23.6	17.8

RT-H 比 RT-2 MSE 低约 20%。Ground truth language motion (RT-H GT) 与 end-to-end 的 gap 达 40%，说明正确的 language motions 对 action prediction 高度 informative。

Contextual & Flexible Language Motions

Contextuality: 相同的 language motion 在不同 task/scene 下产生不同的 action。例如 “move arm forward” 在不同 task 中会有微妙的方向、速度、旋转差异来适配 context。

Figure 4. Language motions 的 contextuality

Flexibility: RT-H 可以响应 out-of-distribution 的 language motions。例如，“pull napkin” 任务可以用 “right and down” 或 “up and backward” 两种不同 language motions 完成。

Figure 5. Language motions 的 flexibility

Training on Online Corrections

比较 language motion corrections vs teleoperated corrections。每个 task 30 episodes correction data。

Figure 6. Correction 实验结果

RT-H + Human Intervention（上界）：接近 perfect success rate，说明 language motion prediction 是性能瓶颈
RT-H-Intervene（language motion corrections）：从 40% 提升到 63%，在 open/close jar 上比 RT-2-IWR 高 60-70%
RT-2-IWR（teleoperated corrections）：从 25% 降到 13%——teleop corrections 引入的 action distribution shift 太大，少量数据下难以学习
RT-H-InterveneAction（同时训练 action + language motion）：比 RT-H 提升 9%，但有 policy degeneration 风险

核心 insight：在 language motion space 学习 correction 比 action space 更 sample efficient，因为 language motion space 更 compressed 且与 training distribution 更一致。

Generalization

在 Kitchen 数据集上训练，测试三种 generalization：

New scenes: RT-H 和 RT-H-Joint 在新环境（不同 lighting、background、flooring）下比 RT-2 更 robust。

Figure 7. 新场景 generalization

New objects: RT-H 65% vs RT-2 55%（pick + move tasks with unseen objects）。

Table II. Novel object generalization

	pick	move	Average
RT-2	60	50	55
RT-H	70	60	65

New tasks with corrections: 对完全未见过的 task，RT-H 的 language motions 可以部分 zero-shot generalize（如共享的 picking behavior），剩余部分通过少量 language motion corrections 完成。

Figure 8. 新任务 generalization（配合少量 correction）

关联工作

基于

RT-2: RT-H 直接基于 RT-2 的 PaLI-X 55B VLM 架构，替换 action prediction query 为 language motion + action 两阶段 query
PaLI-X: 55B Multimodal Encoder-Decoder Transformer backbone，22B ViT encoder

对比

RT-2: Flat model baseline，直接从 task + observation 预测 action
IWR (Intervention Weighted Regression): Teleoperation-based interactive IL baseline，RT-H-Intervene 的对比方法

方法相关

RT-1: Kitchen 数据集来源；RT-H 在同一 multi-task 设置下评估
SayCan: LLM planning 方法，处理 long-horizon instruction decomposition，与 RT-H 的 action hierarchy 在不同层级工作
DAgger: Interactive IL 的经典方法，RT-H 的 language motion correction 是其在 language space 的类比

论文点评

Strengths

Simple, scalable 的核心 idea: Language motions 作为 intermediate layer 是一个 elegant 的抽象——利用 language 的 compositional structure 来 bridge task 和 action，conceptually clean
自动标注免去人工成本: 从 proprioception 自动提取 language motions，避免了 human annotation 的 inconsistency 和成本，practical
Language correction paradigm: 用 language 做 correction 比 teleoperation 更直观、更 sample efficient，且只需更新 language motion query，大大降低 learning from corrections 的复杂度
充分的 ablation: OneHot / Cluster / Joint 等 ablation 清楚说明了 language structure 和 action hierarchy 各自的贡献

Weaknesses

绝对性能仍然偏低: 8 个 evaluation task 的平均 success rate 约 40%（RT-H），即使 correction 后也只有 63%，实际部署仍有较大 gap
Language motions 的 expressiveness 受限: 当前 language motions 只描述 spatial movement direction（如 “forward”, “left”），缺少 object-referential（“reach the handle”）或 speed 信息（“move slowly”）。作者也承认这是 trade-off
55B 模型的 practical constraints: 基于 PaLI-X 55B，inference latency 需要 async querying workaround，限制了实际部署
单一机器人 embodiment: 所有实验在同一 robot platform 上完成，cross-embodiment transfer 未验证——尽管 language motions 理论上应该有 cross-embodiment 的潜力

可信评估

Artifact 可获取性

代码: 未开源
模型权重: 未发布（PaLI-X 55B 本身也非公开）
训练细节: 超参和训练配比在论文中有详细描述（learning rate、batch size、sampling ratio 等），但 PaLI-X pre-training mixture 细节未披露
数据集: Kitchen 数据集来自 RT-1/RT-2（部分通过 OXE 公开），Diverse 数据集未公开

Claim 可验证性

✅ RT-H 比 RT-2 高 15%：有 8 task 的 per-task 结果 + 95% Wilson Score CI + 详细的 staged success rates
✅ Language motion corrections 比 IWR 更 sample efficient：相同数据量（30 episodes/task）下的 head-to-head 比较
⚠️ Contextuality claims：Fig. 4 的定性分析 convincing，但 Table III 的定量分析仅展示 action dimension 的 mean/std，不是严格的 contextuality 测量
⚠️ Generalization 结果：novel object/scene 实验规模有限（50 scenarios），且只在 Kitchen 子集上测试

Notes

Rating

Metrics (as of 2026-04-24): citation=193, influential=11 (5.7%), velocity=7.51/mo; HF upvotes=9; github=N/A (无代码仓库)

分数：2 - Frontier 理由：RT-H 的 “language motion 作为 intermediate action layer” 是 hierarchical VLA 方向上的代表性范式工作，被 VLA survey 与后续 hierarchical robot transformer（如 HiRT）引用作为关键参考；但方法本身受限于 PaLI-X 55B 闭源 backbone、未开源代码权重，且绝对成功率仍偏低（见 Weaknesses 1、3），社区采纳度不及 RT-2 或 OpenVLA 这类 de facto baseline，因此还未达到 3 - Foundation；同时它又明确高于 1 - Archived——idea 至今仍被当作 language-conditioned action abstraction 的原型引用，未被取代。

MindFlow

Explorer

RT-H: Action Hierarchies Using Language

Summary

Introduction

RT-H: Action Hierarchies using Language

Formalizing Action Hierarchies

RT-H: Model and Training Details

Extracting Language Motions

RT-H: Inference & Correction

Correction via Language Motions

Experiments

RT-H on Diverse Multi-Task Datasets

Contextual & Flexible Language Motions

Training on Online Corrections

Generalization

关联工作

基于

对比

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents