Summary
NaVILA 将 VLM(VILA)微调为 navigation VLA,通过生成 mid-level 语言化动作指令(如 “move forward 75cm”)而非直接输出 low-level joint action,再由 RL locomotion policy 执行,实现了 legged robot 上的 vision-language navigation。
Problem & Motivation
现有 VLN 方法主要在 discrete nav-graph 或简单 continuous 环境中工作,缺乏对真实 legged robot 的部署能力。直接用 VLA 预测 low-level joint action 面临 sim-to-real gap 和高维 action space 的挑战。NaVILA 提出用语言作为 mid-level action representation,解耦高层语义理解和低层运动控制。
Method
- VLM Backbone: VILA(Visual Instruction Language Architecture),基于 stage-2 预训练模型微调
- Two-level hierarchy:
- High-level VLA: 接收 egocentric 视觉 + 语言指令,输出 mid-level 语言动作(如 “turn left 30 degrees”, “move forward 75cm”)
- Low-level RL policy: 将语言动作转换为 joint-level control
- Training data: 四类数据混合训练
- Real-world YouTube egocentric touring videos(2000 videos → 20000 trajectories)
- Simulated navigation(R2R-CE, RxR-CE in Habitat)
- Auxiliary tasks(EnvDrop augmented instructions, ScanQA 3D QA)
- General VQA datasets(保持通用能力)
- Action space: Mid-level 语言化动作(discrete 语义 + continuous 空间信息)
Key Results
- R2R-CE: 54.0% SR(比 prior SOTA 提升 17%)
- RxR-CE: 44.0% SR
- ScanQA: 102.7 CIDEr(超越需要 depth/pose 的 3D 模型)
- Real-world: 88% 成功率(Unitree Go2 四足机器人,25 条指令)
- 引入 VLN-CE-Isaac benchmark(Isaac Sim 高保真环境)
Strengths & Weaknesses
Strengths:
- 语言作为 mid-level action 的设计非常优雅,自然解耦了感知理解与运动控制
- 大规模利用 YouTube 视频数据,解决了 robot navigation data 稀缺问题
- 真正实现了 sim-to-real transfer on legged robots
- 架构 robot-agnostic,low-level policy 可替换
Weaknesses:
- Mid-level 语言动作的粒度选择可能需要 task-specific 调优
- 依赖预训练 RL locomotion policy 的质量
- 对 VLN-VLA 统一的启示:语言化 mid-level action 是否可以扩展到 manipulation?
Mind Map
mindmap root((NaVILA)) Problem VLN on legged robots Sim-to-real gap High-dim action space Method VILA VLM backbone Two-level hierarchy Mid-level language actions RL locomotion policy Multi-source training data YouTube videos Simulated navigation Auxiliary tasks Results R2R-CE 54% SR Real-world 88% SR VLN-CE-Isaac benchmark
Notes
- NaVILA 是 VLN-VLA 统一最直接的证据之一:它本质上就是一个 navigation-focused VLA
- 与 π0.5 的关键区别:π0.5 用 language subgoal 驱动 low-level flow matching action expert,NaVILA 用 language action 驱动 RL locomotion policy
- 都使用 hierarchical 架构,但 action representation 不同(continuous flow vs. discrete RL policy)
- YouTube 视频数据的利用方式值得关注——能否同样用于 manipulation?