RAGEN-2: Reasoning Collapse in Agentic RL

Summary

RAGEN-2: Reasoning Collapse in Agentic RL

核心: 多轮 agent RL 中存在一种被 entropy 完全遮蔽的失败模式 —— template collapse：reasoning 在同一 input 内看似多样，但跨 input 之间几乎一模一样、对 input 不再敏感

方法: 用 mutual information $I (X; Z)$ 作为新诊断（in-batch cross-scoring 的 retrieval-acc / MI-ZScore-EMA），SNR 视角解释 collapse 来自 low reward variance 让 KL/entropy 正则项压过 task gradient；提出 SNR-Aware Filtering（按 prompt 的 reward variance top-p 选样）

结果: MI 与最终性能 Spearman $+ 0.39$ ，entropy 反而 $- 0.11 \sim - 0.14$ ；SNR-Aware Filtering 在 PPO/DAPO/GRPO/Dr.GRPO、0.5B–7B、Qwen/Llama/Qwen-VL 多个 axis 上几乎一致提升任务成功率（Sokoban 上 +6 ~ +23 pt），且因为 batch 变小 step time 减少 26-41%

Sources: paper | website | github

Rating: 3 - Foundation（把 entropy 拆成 $I (X; Z) + H (Z ∣ X)$ 的 framework + SNR 机制 + single-knob filter，三件事任意一件都足够 reshape 我对 agentic-RL 诊断/干预的 mental model）

Key Takeaways:

Entropy is misleading for reasoning quality: 标准 entropy 只测 $H (Z ∣ X)$ （within-input diversity），与 input dependence $I (X; Z)$ 正交。Template collapse 表现为高 entropy + 低 MI，被所有现有 process-stability metric 漏检。把 $H (Z) = I (X; Z) + H (Z ∣ X)$ 拆开是这篇文章最核心的概念贡献
MI proxy 不需要外部模型：复用 rollout 里的 $(X_{i}, Z_{i, k})$ pair，对每对 $(Z_{i, k}, X_{j})$ 跑一遍 teacher-forced log-prob 得到 scoring matrix，retrieval-acc 在 collapse 下退化到 $1/ P$ chance level
SNR 机制把 collapse 归因到 gradient 层：低 reward variance 让 task gradient $∥ g_{task} ∥ \leq Var (R ∣ X) \cdot C$ 趋零，而 KL+entropy 的 $g_{reg}$ 与 reward variance 无关、保持常量。整批更新被输入无关的方向主导，自然把 $I (X; Z)$ 压到 0 而 $H (Z ∣ X)$ 不动
SNR-Aware Filtering 是单参数干预：按 per-prompt reward variance 排序，top- $ρ$ 的 prompt 进梯度更新。零额外 rollout、零额外模型。在 8 个不同 task / 4 个 RL 算法 / 4 个 model size / 文本+视觉两种 modality 上几乎一致提升
可预测何时有效：Std(RV)/Mean(RV) 比值高 → 信号-噪声 prompt 双峰分布 → filtering 收益大；ratio 接近 0（如 FrozenLake GRPO ratio=0.33）则 filtering 退化为随机丢数据，反而 -5%

Teaser. 四象限 reasoning regime —— template collapse 是右上角的隐性失败模式（高 $H (Z ∣ X)$ 但 $I (X; Z) \approx 0$ ），entropy 看不见。

Background：为什么需要重新审视 entropy

Multi-turn agent RL 训练里，研究者长期把 reward 当 outcome stability 的 proxy、把 entropy 当 process stability 的 proxy。但 entropy 是 ambiguous 的：

下降可能是模型变 confident（健康收敛）
保持高位可能是真的多样、也可能是 fixed template 在不同 input 下重复（伪多样）

作者把 marginal entropy 做经典分解：

H (Z) = I (X; Z) + H (Z ∣ X)

$H (Z ∣ X)$ ：within-input diversity，entropy metric 测的就是这个
$I (X; Z)$ ：input dependence，反映 reasoning 是否真的随 input 变化

Template collapse 就发生在第二项坍塌而第一项不动的情况下——所有现有 stability metric 看不到。

❓ 这个分解概念上很简单，本质是 $I (X; Z) = H (Z) - H (Z ∣ X)$ 。RL 社区之前没人这么用，可能是因为大家默认 reasoning 自然 input-dependent。值得问的是：在 single-turn RLHF 里 template collapse 是否同样普遍？作者没明确做 single-turn 对比。

Diagnosis：MI Proxy Family

In-Batch Cross-Scoring

给定 $P$ 个 prompt、每个 $G$ 个 reasoning sample，构造 $P \times G \times P$ 的打分矩阵：

L_{i, k, j} = lo g p_{θ} (Z_{i, k} ∣ X_{j})

提取两个 length-normalized 量：

matched_{i, k} = \frac{L _{i, k, i}}{∣ Z _{i, k} ∣}, marginal_{i, k} = \frac{1}{∣ Z _{i, k} ∣} lo g \frac{1}{P} j \sum exp (L_{i, k, j})

直觉：matched 是 reasoning $Z_{i, k}$ 在它真正源 prompt $X_{i}$ 下的 per-token log-likelihood；marginal 是它在 batch 内 prompt 均匀混合下的 log-likelihood。两者之差 = 该 reasoning 对源 prompt 的 specificity。

两个主推 proxy

Retrieval-Acc（discrete，可解释）：

Acc = \frac{1}{PG} i, k \sum I [i = ar g j max L_{i, k, j}]

Collapse 时趋近 chance level $1/ P$ （ $P = 64$ 时 1.56%），有绝对参照。

MI-ZScore-EMA（continuous，鲁棒）：

I (X; Z) = \frac{1}{PG} i, k \sum (matched_{i, k} - marginal_{i, k})

再做 z-score（用 EMA 平滑的 batch std）。Collapse 时分子两项趋近相等，proxy 趋近 0。

Proxy Family 概览

Table 1. MI proxy 家族。沿 turn scope（first-turn vs trajectory）、aggregation（discrete vs continuous）、length normalization（per-token vs per-sequence）三个轴变化。

Type	Proxy	关键性质
Discrete	Retrieval-Acc	Collapse 下趋 $1/ P$
Discrete	Recall@k	$k \in {2, 4, 8}$
Continuous (raw)	MI-Est	Per-token，collapse 下 → 0
Continuous (raw)	MI-Seq-Est	Per-sequence，无长度归一
Continuous (z-score)	MI-ZScore	Batch std 归一
Continuous (z-score)	MI-ZScore-EMA	EMA std 归一，最稳定

验证：MI vs Entropy 与性能的 Spearman 相关

Trajectory MI-ZScore 与最终 task perf 的 Spearman 相关 $+ 0.39$ ；Reasoning Entropy / Conditional Entropy $- 0.11 \sim - 0.14$ （方向相反）。

Figure 8. MI 家族 metric 与性能正相关，entropy 家族近零或负——entropy 在多轮 agent RL 里作为 process diagnostic 是误导的。

Mechanism：Signal-to-Noise Ratio View

Empirical Observation

把 prompt 按 within-prompt reward variance 分 6 个等大 bucket，分别测 task gradient norm 和 regularization gradient norm：

Figure 3. 三个 pattern：(a) $∥ g_{task} ∥$ 单调随 RV bucket 上升；(b) RV 趋 0 时 task gradient 仍非零，但携带的有效信号几近为零；(c) $∥ g_{reg} ∥$ （KL + entropy）跨 bucket 完全平稳。

最低 RV bucket 里，更新几乎完全由输入无关的 regularization 主导。

Gradient Decomposition

对 input $x$ 的 $G$ 条 trajectory， $A_{g} = R_{g} - \overset{ˉ}{R} (x)$ ，task gradient

g_{task} (x) = \frac{1}{G} g \sum A_{g} \nabla_{θ} lo g π_{θ} (τ_{g} ∣ x)

Cauchy-Schwarz 给出：

∥ g_{task} (x) ∥ \leq Var (R ∣ X = x) \cdot C

低 reward variance 直接 cap 住 task gradient 上界，而 $g_{reg}$ 不变，于是 SNR 跌掉。

Table 2. 三噪声分解。

Component	来源	Level	可控	Mitigation
$g_{signal}$	Same-prompt 不同 trajectory 间真实 reward 差	Prompt	No	SNR-Aware Filtering
$g_{task-noise}$	Sampling / 环境 stochasticity	Prompt	No	Filter high-noise prompts
$g_{reg}$	KL/entropy 的均匀收缩，与 input 无关	Chain	Yes	Tune $λ_{KL}$ , $λ_{ent}$

关键观察： $g_{reg}$ 是 chain-level 而非 prompt-level，所有 reasoning chain 都吃到一样的均匀收缩，因此天然 input-agnostic——它就是把 $I (X; Z)$ 推向零的直接力量。

Figure 2. Schematic SNR view：高 RV → 强 task gradient + 好收敛；低 RV → regularization 主导 + erratic update + input-agnostic reasoning。

这套 SNR 论述把 template collapse 解释为”gradient 层面的 over-regularization”。比之前各种 ad hoc 的 entropy/KL 调参角度漂亮很多——这是这篇 paper 我最喜欢的部分。

Method：SNR-Aware Filtering

Top-p by Reward Variance

每个 iteration：

Rollout：每个 prompt sample $G$ 条 trajectory
算 $Var (R ∣ X) = \frac{1}{G - 1} \sum_{g} (R_{g} (X) - \overset{ˉ}{R} (X))^{2}$
按 RV 降序排，accumulate variance mass 至阈值 $τ = ρ \sum_{i} Var (R ∣ X = x_{i})$
只在选中的 prompt 集合 $S$ 上做梯度更新

类比 nucleus sampling，但对象是 prompt 不是 token。Top-p 比 top-k 更优——它能在某 batch 信号普遍弱时拒绝整批，而 top-k 总是塞满。

Figure 4. SNR-Aware Filtering 工作流：rollout → 算 RV → top-p 选 prompt → 在 high-signal subset 上更新。

Experiments

Setup

Table 3. 7 个 environment：

Task	Stochastic	Multi-turn	State	Reward
Sokoban	✗	✓	Grid	Dense
FrozenLake	✓	✓	Grid	Binary
MetaMathQA	✗	✓	Text	Dense
Countdown	✗	✗	Text	Binary
SearchQA	✗	✓	Text	Dense
WebShop	✗	✓	Text	Dense
DeepCoder	✗	✗	Text	Dense

主体在 Qwen2.5-3B + veRL/HybridFlow stack 上，对比 PPO / DAPO / GRPO / Dr. GRPO，最多 400 iter，每 iter $K = P \times G = 128$ trajectory（默认 $P = 8, G = 16$ ）。

Template Collapse 是普遍现象

Figure 5. 训练 dynamics：MI proxy（retrieval acc）早于 task perf 下跌就先开始降，而 conditional entropy 始终高位—— template collapse 的 hallmark。MI 是早期预警信号，entropy 错过整个事件。

Figure 7. 8 个 environment 上 reasoning length 单调下降——template collapse 的 behavioral signature 是输出越来越短、越模板化。

SNR-Aware Filtering 跨 axis 都 work

Table 4. Baseline peak (+filter delta) %。Filtering 在 4 个 task / 4 个算法 / 4 个 model size / 2 个 model family / 文本+视觉 modality 上几乎一致提升 average score。

Variant	Sokoban	FrozenLake	MetaMath	Countdown	Avg
PPO Qwen2.5-3B	12.9 (+16.0)	67.0 (+10.9)	92.6 (+0.6)	97.9 (+0.0)	67.6 (+6.9)
DAPO	16.2 (+5.1)	66.8 (+2.1)	90.8 (+2.8)	95.7 (+1.6)	67.4 (+2.9)
GRPO	12.1 (+9.0)	70.9 (-3.0)	91.2 (+1.2)	95.7 (+2.2)	67.5 (+3.7)
Dr. GRPO	12.1 (-0.4)	23.2 (+0.6)	91.2 (+1.4)	96.5 (+1.4)	55.8 (+0.8)
Qwen2.5-0.5B	3.3 (+22.9)	19.5 (+0.0)	10.0 (-0.2)	23.0 (-0.7)	14.0 (+5.5)
Qwen2.5-7B	42.4 (+4.9)	85.0 (-0.6)	84.0 (+11.7)	97.7 (+0.3)	77.3 (+4.1)
Llama3.2-3B	24.4 (+18.8)	84.6 (-0.2)	86.1 (+3.7)	99.2 (-1.2)	73.6 (+5.3)
Qwen2.5-VL-3B (V)	65.0 (+12.0)	19.5 (+59.5)	–	–	42.3 (+35.8)

Sokoban 上 PPO baseline 只有 12.9%，加 filter 跳到 28.9%——这是非常大的 gain。但对应基线本身偏弱；DAPO baseline 16.2 + filter → 21.3 仍弱于 GRPO 21.1。值得问的是 filter 是不是把弱基线拉到接近”应得水平”，而非真正打开新 ceiling。Qwen-VL 上 FrozenLake 19.5 → 79.0 的 +59.5 是非常激进的提升，怀疑 baseline 是否调到位。

计算开销

Table 5. 固定 $K = 128$ rollout budget 下，filtering 因为减少 effective minibatch size，step time 实际降低 26-41%，VRAM 几乎不变。RV 计算本身 <0.1% 时间。

$P \times G$	NF perf	F perf	$Δ$	NF time	F time	$Δ$ %
128×1	23.6	–	–	89.8	–	–
64×2	18.8	27.3	+8.6	91.8	64.9	-29%
32×4	24.2	27.4	+3.2	89.8	52.6	-41%
8×16	15.6	23.6	+8.0	89.2	65.9	-26%

Causal Analysis

Quartile Ablation —— 直接 RV 干预

将 prompt 按 RV 分四 quartile，每次只用一个 quartile 训练，task perf 和 MI 单调随 quartile 退化（Q1→Q4：21.1→11.0），坐实 RV → gradient quality → input-dependent reasoning 的因果链。

Table 6.

Quartile	RV Range	Task Perf (%)	MI Proxy	Entropy
Q1 (highest RV)	[4.4–5.6]	21.1	0.95	2.02
Q2	[1.5–4.2]	19.5	0.93	1.53
Q3	[0.0–0.2]	10.7	0.81	1.41
Q4 (lowest RV)	[0.0–0.1]	11.0	0.73	1.87

Noise Injection —— 高 stochasticity 下退化

Figure 9. FrozenLake 上把环境 stochasticity 从 0% 调到 100%：task return 跌、entropy 升、 $I (X; Z)$ 单调降。Filter 优势在 0–50% 维持，80–100% 趋近——high noise 让 RV 失去 discriminative power，恰是机制预测的边界。

Prompt-level vs Trajectory-level Filtering

Table 7. 用 trajectory-level filter（保留 top-8/bottom-8 trajectory）作为对照，确认 gain 不是 prompt distribution shift 带来的。Prompt-level filter 显著优。

Method	Prompts	Traj/Update	Task Perf	MI
No filter	8/8	128	12.9	0.83
Prompt-level RV ( $ρ = 0.9$ )	3.2/8	50.6	23.6	1.80
Trajectory-level	8/8	64	16.8	0.20

何时有效：Std(RV)/Mean(RV)

Table 8. 这个 ratio 可以从单个 rollout batch 算，训练前就能预判 filter 是否有用。Ratio 高 → bimodal RV → filter 干净分离 signal/noise；ratio 低 → uniform RV → filter 退化为随机丢。

Setting	Filter $Δ$	Std/Mean
Sokoban, 14B	$+ 4.6%$	1.29
Sokoban, 3B	$+ 3.2%$	1.16
FrozenLake, 3B (GRPO)	$- 5.0%$	0.33

这是这篇 paper 我第二喜欢的设计——它没有把 filter 包装成 universal solution，反而给出 cheap, predictable 的失效条件。这种自我诚实在 RL paper 里少见。

与现有 stabilizer 的关系

Figure 13. 三种 intervention 的轨迹：entropy / KL tuning 主要沿 $H (Z ∣ X)$ 轴动，几乎不动 $I (X; Z)$ ；只有 SNR-Aware Filtering 沿对角线（MI ↑ + perf ↑）走。说明三者正交，可叠加使用。

Format Validity ≠ Content Diagnostic

Figure 12. 几乎所有 run 都保持近完美 format validity，但 MI 可以低到 collapse。Format 检查不能替代 content-sensitive diagnostic。

关联工作

基于

RAGEN v1 (wang2025ragenunderstandingselfevolutionllm)：StarPO 框架 + 多轮 agent RL testbed，本文延续其 testbed 并诊断其失败模式
veRL / HybridFlow (sheng2024hybridflow)：训练栈
Cover & Thomas, Elements of Information Theory： $H (Z) = I (X; Z) + H (Z ∣ X)$ 标准分解

对比 / 算法基线

PPO (schulman2017proximalpolicyoptimizationalgorithms)：经典 baseline
DAPO (yu2025dapo)：dynamic sampling，本文视作 filter $ρ \to 1$ 特例
GRPO (shao2024deepseekmathpushinglimitsmathematical)：group-relative policy optimization
Dr. GRPO (liu2025understandingr1zero)：debiased GRPO

方法相关

Reasoning collapse / policy degeneracy 文献 (wei2025gtrguidedthoughtreinforcement, yao2025diversityawarepolicyoptimizationlarge, yun2025priceformatdiversitycollapse)：报告类似现象，但用 entropy / lexical metric 做 diagnostic，本文论证这些 metric 不充分
Model collapse in self-training (gerstgrasser2024modelcollapseinevitablebreaking, Shumailov2024AIMC)：closed-loop training 导致分布坍塌的更早期文献
Reasoning faithfulness (lanham2023measuringfaithfulnesschainofthoughtreasoning, turpin2023languagemodelsdontsay)：CoT 是否真反映 decision basis；本文聚焦 input-dependence 而非 faithfulness
EPO (xu2025epoentropyregularizedpolicyoptimization)、Diversity-aware PO (yao2025diversityawarepolicyoptimizationlarge)：entropy / diversity regularization 类干预，本文显示它们沿 $H (Z ∣ X)$ 轴动而不动 $I (X; Z)$ ，与 SNR filter 正交

论文点评

Strengths

概念贡献清晰且 generalize：把 entropy 拆成 $I (X; Z) + H (Z ∣ X)$ 是非常 first-principles 的视角。一旦提出，回头看会觉得”显然”——这是好 framework 的标志。MI proxy 的 in-batch cross-scoring 几乎零额外开销（复用 rollout 已有的 sequences），落地门槛极低
SNR 机制把现象到方法的因果链打通：从经验观察（gradient 按 RV bucket 的差异）→ 数学分解（Cauchy-Schwarz bound）→ 干预方法（top-p RV filter）→ 因果验证（quartile ablation + noise injection）一脉相承。Quartile ablation 尤其漂亮——是真正的因果证据而非 correlation
失效条件明确：Std(RV)/Mean(RV) 是 cheap diagnostic，可以预训练阶段判断 filter 是否值得开。这种”知道方法什么时候不 work”的诚实在 RL paper 里非常少见
跨 axis 验证 breadth 够：4 task × 4 algo × 4 model size × 2 model family × 2 modality，覆盖维度足以排除是某个特定 setup 的伪影
Compute 反而下降：filter 因为减小 effective batch，step time 降 26-41%，意味着这个方法 strictly dominates baseline（精度↑ + 时间↓）

Weaknesses

Baseline 强度问题：PPO Sokoban baseline 只有 12.9%，filter 后 28.9% 仍低于直接用 GRPO 21.1%。许多大 gain 集中在弱 baseline + 难 task 上。怀疑 filter 部分作用是把”调参不到位的 baseline”补到合理水平。应该 report：在最强基线（如 GRPO + 充分调 KL/entropy）之上 filter 还能提多少
Qwen2.5-VL 上 FrozenLake +59.5 的 delta 太极端：baseline 19.5 → 79.0。这种量级的提升通常预示 baseline 没收敛或某个 hyperparam 错了，需要 sanity check
MI proxy 依赖 in-batch sample： $P = 8$ 时 retrieval chance level $1/8 = 12.5%$ ，区分度可能不够。 $P$ 取多大才稳作者没系统 ablate
多轮信号被压扁：MI proxy 只看每个 turn 的 reasoning $Z_{t}$ ，trajectory variant 是 uniform 跨 turn 采样。多轮 agent 的 reasoning 应该有 turn-position-dependent 结构（早期 turn 偏 plan、后期偏 reactive），uniform 处理可能掩盖 partial collapse
DAPO 已经做 filtering，作者把它视作”top- $P$ filter with $ρ \to 1$ “的特例。但 DAPO 的 dynamic sampling 实质是 reject zero-advantage rollout，机制不完全等价。需要更细致的 ablation 把两种 filter 解耦
没讨论与 reward shaping 的交互：dense reward 任务（Sokoban / WebShop / DeepCoder）和 binary reward 任务（FrozenLake / Countdown）filter delta 差异巨大，但作者没把这条 axis 单独拆出讨论
Multi-agent / long-horizon 外推未做：作者自己在 Limitations 提到了，单 agent 上的结论能否迁到 multi-agent RL（reward 更稀疏、更 noisy）是开放问题

可信评估

Artifact 可获取性

代码: inference + training（mll-lab-nu/RAGEN，含 SNR-Adaptive Filtering 实现）
模型权重: 未说明（README 未列 released checkpoint）
训练细节: 较完整——RL algo（PPO/DAPO/GRPO/Dr.GRPO）+ veRL/HybridFlow stack + Qwen2.5 / Llama3.2 / Qwen2.5-VL + $P = 8, G = 16, K = 128$ 默认 + $ρ$ keep rate + 400 iter；具体每个 task 的 reward 设计、env config、 $λ_{KL}, λ_{ent}$ 取值在 Appendix
数据集: 全部开源——Sokoban / FrozenLake / MetaMathQA / Countdown / SearchQA / WebShop / DeepCoder 都是公开 env

Claim 可验证性

✅ MI proxy 与性能 Spearman +0.39，entropy 负相关：Figure 8 对多个 intervention sweep 系统计算，方法 reproducible
✅ Quartile ablation 显示 RV → MI/perf 单调关系：Table 6 的 controlled intervention 是真正的因果证据
✅ Filtering 跨 4 task × 4 algo × 4 model size × 2 modality 提升 average：Table 4 covered grid 充分
✅ Step time 降 26-41%：Table 5 直接 wall-clock 测量
⚠️ “Template collapse 是 systematic failure mode”：在所有 RL setting 下都出现的 claim 强但只在 RAGEN testbed（特定 task subset、特定 stack）验证，外推到生产规模 RLHF 是 extrapolation
⚠️ “DAPO 是 SNR-Aware Filtering 的 $ρ \to 1$ 特例”：概念类比 plausible 但两者 filtering rule 不完全等价（DAPO 是基于 advantage 是否非零的硬筛选，而非 RV 排序），需要更直接的 head-to-head
⚠️ VL 模型 +59.5 pt FrozenLake：异常大的 delta 可能暗示 baseline 未充分收敛
❌ 无明显营销话术；论文整体态度克制，自己列出 Limitations 并指出 filter 可能被 model gaming（人为膨胀 RV）

Notes

概念上最大的 takeaway：把 entropy 拆成 MI + conditional entropy 这一步看起来”显然”，但作为 RL training diagnostic 之前没人做。这是典型的”有了 framework 之后觉得显然”的好工作。值得迁移到其他 closed-loop training 场景（self-improvement、RLAIF、Constitutional AI）问同样问题——它们的 process metric 是否也漏检 cross-input collapse？
方法可移植性：SNR-Aware Filtering 是 single-knob, drop-in，理论上能直接加到任何 group-based RL（GRPO 系列、RLOO、ReMax 等）。值得跑一下 RLHF 主流 stack（如 OpenRLHF / verl mainline）上的对照实验
与 RV-based curriculum 的关系：很多 curriculum / hard-example mining 工作也按 reward / advantage variance 选样。本文 framework 给出了从 SNR 角度统一这些方法的可能路径——它们也许都在隐式提升 SNR，只是没明确说
❓ MI proxy 在 long-horizon 上的可靠性：trajectory variant uniform 跨 turn 采样可能稀释信号。如果按 turn position 分别测 MI，是否能看到 collapse 先在某些 turn position 上发生？这是 cheap follow-up
❓ 与 RLVR / 规则 reward 的交互：本文实验 reward 都是 task-level（成功/失败、step bonus）。如果是 verifier-based dense reward（如 PRM），RV 分布形态会很不同，filter 行为可能改变
❓ 预测 RV 分布：Std(RV)/Mean(RV) 是事后 diagnostic。能否在 task / dataset 层面有先验预测？例如 binary reward + low-success-rate task 必然 RV 集中在低区，filter 不会有效
写作上：作者主动列 Limitations 并标出”capable model 可能 game filter（artificially inflate RV）“——这种自我诚实是 evidence-driven 的标志

Rating

Metrics (as of 2026-04-24): citation=0, influential=0 (0%), velocity=0.00/mo; HF upvotes=65; github 2632⭐ / forks=220 / 90d commits=25 / pushed 10d ago

分数：3 - Foundation 理由：给了一个 generalize 的 framework（ $H (Z) = I (X; Z) + H (Z ∣ X)$ 分解 + SNR 归因），不是 +0.3% SOTA 类工作——它会改变我今后看 agentic-RL paper 的诊断习惯（必问是否有 template collapse）。MI proxy 和 top-p filter 都是 zero-overhead、可 drop-in 到任何 group-based RL stack 的 artifact（Strengths 1/5），外加 quartile ablation 这种真因果证据（Strengths 2）——这些让它超出 “2 - Frontier（某个 SOTA / baseline）” 的定位；不是 3 却又比 2 更高的边界在于：虽然还没被社区作为”必引”验证（论文才出），但其 framework 的 first-principles 程度和可迁移性已经具备 foundation 的潜质。

MindFlow

Explorer

RAGEN-2: Reasoning Collapse in Agentic RL

Summary

Background：为什么需要重新审视 entropy

Diagnosis：MI Proxy Family

In-Batch Cross-Scoring

两个主推 proxy

Proxy Family 概览

验证：MI vs Entropy 与性能的 Spearman 相关

Mechanism：Signal-to-Noise Ratio View

Empirical Observation

Gradient Decomposition

Method：SNR-Aware Filtering

Top-p by Reward Variance

Experiments

Setup

Template Collapse 是普遍现象

SNR-Aware Filtering 跨 axis 都 work

计算开销

Causal Analysis

Quartile Ablation —— 直接 RV 干预

Noise Injection —— 高 stochasticity 下退化

Prompt-level vs Trajectory-level Filtering

何时有效：Std(RV)/Mean(RV)

与现有 stabilizer 的关系

Format Validity ≠ Content Diagnostic

关联工作

基于

对比 / 算法基线

方法相关

论文点评

Strengths

Weaknesses

可信评估

Artifact 可获取性

Claim 可验证性

Notes

Rating

Table of Contents