ReAct 论文注解

Synergizing Reasoning and Acting in Language Models

Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Affiliations: Princeton University & Google Research (Brain Team)
Venue: ICLR 2023 (Oral) · First published Oct 2022
Links: arXiv · Project Page · Code

§0Abstract 逐句注解

下面是论文 Abstract 的原文，每句配有中文翻译和学习笔记。高亮词汇hover 可看释义表示值得记住的学术表达。

While large language models (LLMs) have demonstrated impressive capabilities令人印象深刻的能力 across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics主要被作为独立课题研究.

虽然大语言模型在语言理解和交互式决策方面展示了令人印象深刻的能力，但它们的推理能力（如 Chain-of-Thought prompting）和行动能力（如动作计划生成）一直被作为独立课题分开研究。

这句定义了论文要解决的"割裂"：推理和行动各自有人做，但没人把它们放在一起。

    In this paper, we explore the use of LLMs to generate both
    reasoning traces推理轨迹 — 模型用自然语言写出的思考过程
    and task-specific actions in an interleaved manner, allowing for
    greater synergy更大的协同效应
    between the two: reasoning traces help the model
    induce, track, and update action plans as well as handle exceptions归纳、追踪、更新行动计划并处理异常,
    while actions allow it to
    interface with external sources与外部信息源交互,
    such as knowledge bases or environments, to gather additional information.
  

本文中，我们探索让 LLM 以交替的方式同时生成推理轨迹和任务特定的动作，使两者产生更大的协同效应：推理轨迹帮助模型归纳、追踪和更新行动计划并处理异常；而动作则让模型能与知识库或环境等外部信息源交互，获取额外信息。

核心主张 — "interleaved manner"（交替方式）是整篇论文的灵魂词。不是"先想完再做"或"先做完再想"，而是每一步都交替。

    We apply our approach我们将此方法应用于,
    named ReAct, to a
    diverse set of language and decision making tasks多样的语言和决策任务集合
    and
    demonstrate its effectiveness over state-of-the-art baselines证明其优于当前最优基线的有效性,
    as well as
    improved human interpretability and trustworthiness提升的人类可解释性与可信度
    over methods without reasoning or acting components.
  

我们将此方法命名为 ReAct，应用于多样的语言和决策任务集合，证明了其优于当前最优基线的有效性，并且相比没有推理或行动组件的方法，具有更好的人类可解释性和可信度。

注意 "interpretability and trustworthiness" — 这不仅是跑分高，而是人能看懂模型在想什么。这是 ReAct 区别于黑箱方法的关键卖点。

    Concretely, on question answering (HotpotQA) and fact verification (Fever),
    ReAct overcomes issues of
    hallucination and error propagation幻觉和错误传播
    prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API,
    and generates human-like task-solving trajectories that are more
    interpretable可解释的
    than baselines without
    reasoning traces推理轨迹.
  

具体来说，在问答任务（HotpotQA）和事实验证（Fever）上，ReAct 通过与简单的 Wikipedia API 交互，克服了 Chain-of-Thought 推理中普遍存在的幻觉和错误传播问题，并生成了类似人类的任务求解轨迹，比没有推理轨迹的基线更具可解释性。

    On two interactive decision making benchmarks (ALFWorld and WebShop),
    ReAct
    outperforms超越 / 优于
    imitation模仿（学习）
    and
    reinforcement强化（学习）
    learning methods by an absolute success rate of 34% and 10% respectively,
    while being prompted with only one or two in-context examples.
  

在两个交互式决策基准（ALFWorld 和 WebShop）上，ReAct 仅用一到两个上下文示例进行 prompting，就以绝对成功率 34% 和 10% 的优势分别超越了模仿学习和强化学习方法。

"only one or two in-context examples" — 对比的是需要大量训练数据的 IL/RL 方法。ReAct 用 few-shot 就赢了，这体现了 LLM + prompting 范式的效率。

§1它是什么 — 两条路线的交汇

2022 年之前，LLM 的能力沿两条独立路线发展。ReAct 的贡献是找到了交叉点。

历史交汇点

路线 A：推理 (Reasoning)              路线 B：行动 (Acting)
─────────────────────               ─────────────────────
CoT  (Wei et al. 2022.01)          WebGPT  (OpenAI 2021.12)
Self-Consistency (2022.03)          SayCan  (Google 2022.04)
                                    MRKL    (AI21 2022.05)

            but these two directions have remained separate.

                     ↘            ↙
                      ReAct (2022.10)
             ReAct asks, what if these two
             fundamental capabilities are combined?
                     ↙            ↘
             今天的 Agent 生态         今天的 Tool Use 标准

1970s

Sense-Plan-Act (SPA) — 经典 AI 的感知→规划→执行循环

1998

Sutton & Barto — 强化学习的 Agent-Environment Loop (sₜ → aₜ → rₜ₊₁)

2021.12

WebGPT (OpenAI) — 第一次让 LLM 在浏览器中循环行动

2022.01

Chain-of-Thought (Wei et al.) — 让 LLM "想一想再回答"

2022.10 ⭐
ReAct — 第一次把 Thought + Action + Observation 统一到一个循环

2023.01+

LangChain / AutoGPT / BabyAGI — ReAct 的工程化爆发

2024-now

Claude Code / Cursor / Aider — loop 成为基础设施

外界评价

维度	评价
学术	ICLR 2023 Oral，2000+ 引用。后续几乎所有 agent 论文的 baseline
工程	LangChain 第一个 Agent 类型就叫 `zero-shot-react`；Claude Code / Cursor 的 agent loop 都是 ReAct 变体
概念	"Thought / Action / Observation" 成为 AI Agent 领域的通用术语

§2做了什么 — 四类任务、四种对比

四种方法对比

方法	Thought?	Action?	特点
Standard	✘	✘	直接问直接答
CoT	✔	✘	只推理不行动 → 会编造事实 (hallucination)
Act-only	✘	✔	只行动不推理 → 盲目搜索，缺乏策略
ReAct	✔	✔	推理和行动交替 → 既有策略又有真实信息

五条核心发现

Finding 1

ReAct > Act-only：有 Thought 让模型知道"该搜什么"，而不是乱搜。reasoning traces help the model induce, track, and update action plans.

Finding 2

ReAct > CoT：有 Action 让模型拿到真实信息。ReAct overcomes issues of hallucination and error propagation by interacting with a simple Wikipedia API.

Finding 3

Hallucination 大幅降低：CoT 在 FEVER 上的主要失败原因是编造事实，ReAct 因为能查 Wikipedia，几乎消除了这类错误。

Finding 4 (诚实)

CoT + Self-Consistency 在纯推理上可能更强。论文没有回避自己的短板 — ReAct 解决的是"需要外部信息"的问题，不是所有问题。

Finding 5

ReAct + CoT-SC 融合最强：先用 ReAct，不确定时再用 CoT (SC) 投票，取置信度更高的答案。

§3如何做的 — The Loop

Prompt 五层结构

┌──────────────────────────────────────────────────────┐
│ 1. INSTRUCTION（系统指令）                              │
│    "Solve a question answering task with              │
│     interleaving Thought, Action, Observation..."     │
│    定义三种 Action: Search / Lookup / Finish           │
├──────────────────────────────────────────────────────┤
│ 2. Few-shot 示例（6 道完整示例）                        │
│    每道包含完整 Thought→Action→Observation 循环         │
├──────────────────────────────────────────────────────┤
│ 3. 当前题目                                            │
│    "Question: Were Scott Derrickson and Ed Wood..."    │
├──────────────────────────────────────────────────────┤
│ 4. 已有的推理历史（逐轮累加）                            │
│    Thought 1→Action 1→Observation 1→Thought 2→...     │
├──────────────────────────────────────────────────────┤
│ 5. 当前轮提示前缀                                      │
│    "Thought {i}:" ← 模型从此续写                       │
│    stop_sequences = ["\nObservation {i}:"]              │
└──────────────────────────────────────────────────────┘

循环机制 — 可视化

点击每一步查看详情：

💭

Step 1

Thought

模型推理当前情况，规划下一步行动

→

⚡

Step 2

Action

模型决定调用哪个工具
Search / Lookup / Finish

→

🔍

Step 3

Observation

程序执行 Action，返回真实结果（非模型编造）

→

🔁

Step 4

Append & Loop

本轮结果追加到 prompt，进入下一轮循环

↩ 循环继续，直到模型输出 Finish[answer] 或达到 max_steps (默认 8 轮)

Thought（推理）— 模型用自然语言写出当前的思考。

对应代码：llm(prompt + f"Thought {i}:", stop=[f"\\nObservation {i}:"])
模型看到完整的 prompt（指令 + few-shot + 历史），从 "Thought {i}:" 开始续写。stop_sequences 在模型输出到 Observation 之前截断 — 不让它编造搜索结果。

stop_sequences：控制权的分界线

  run_react.py (你的电脑)                LLM API (远程服务器)
  ═══════════════════════                ══════════════════════

  拼接 prompt + "Thought 1:"
              ── HTTP 请求 ──►
              stop = ["\nObservation 1:"]
                                         模型逐 token 生成...
                                         "...Action 1: Search[X]"
                                         "\nObservation 1:" ← 命中 stop!
                                         ⛔ 强制停止
              ◄── HTTP 响应 ──
              返回 Thought + Action

  解析 Action
  → 调 Wikipedia → 拿到真实结果
  → 作为 Observation 拼回 prompt
  → "Thought 2:" 再发给模型
              ── HTTP 请求 ──►
              ...循环继续...

Key Insight

模型看过 few-shot 里完整的 Thought → Action → Observation 循环，所以它有能力自己编造 Observation。 stop_sequences 就是"拦截点" — 用 few-shot 教模型格式，用 stop_sequences 夺回内容的控制权。这个思想在今天演化成了 tool_use API（Anthropic / OpenAI 的结构化工具调用接口）。

完整示例：HotpotQA idx=0

"Were Scott Derrickson and Ed Wood of the same nationality?"

Thought 1: I need to search Scott Derrickson and Ed Wood, find their nationalities, then determine if they are the same.

Action 1: Search[Scott Derrickson]

Observation 1: Scott Derrickson (born July 16, 1966) is an American filmmaker. He is known for his work in the horror genre...

→ 模型规划了完整策略（搜两个人→比较国籍），先搜第一个人。

Thought 2: Scott Derrickson is an American filmmaker. I need to search Ed Wood next and find his nationality.

Action 2: Search[Ed Wood]

Observation 2: Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor and novelist...

→ 基于 Round 1 的结果，模型知道还需要搜第二个人。

Thought 3: Ed Wood is also an American filmmaker. So Scott Derrickson and Ed Wood are of the same nationality.

Action 3: Finish[yes]

Result: EM = 1 ✔ F1 = 1.00 ✔ 3 轮完成，0 次 parse 失败

→ 两个 Observation 提供了真实国籍信息，模型做出正确判断。

§4实验结果概览

任务	指标	CoT	Act-only	ReAct	ReAct + CoT-SC
HotpotQA	EM	29.4	25.7	27.4	35.1
FEVER	Accuracy	56.3	58.9	60.9	64.6
ALFWorld	Success	—	45%	71%	—
WebShop	Success	—	30.1	40.0	—

数据来自论文 Table 1-4（PaLM-540B 结果）。注意 HotpotQA 上 ReAct 单独并不比 CoT 高很多，但 ReAct + CoT-SC 融合后大幅领先。这体现了两种能力的互补性。

现代模型的提升

论文用的 PaLM-540B 和 text-davinci-002 已是旧一代模型。用 Claude Haiku 4.5 跑本项目的复现代码（run_react.py --n 50），HotpotQA EM 通常能到 40%+，远超论文数字。模型变强了，但 ReAct 的循环架构没有过时。

§V词汇追踪

阅读论文过程中积累的学术英语表达。持续更新。

English	中文	论文中的语境
synergizing	协同 / 使产生协同效应	标题 "Synergizing Reasoning and Acting" — 让推理和行动产生 1+1>2 的效果
reasoning traces	推理轨迹	模型用自然语言写出的思考过程，即 Thought 部分
interleaved manner	交替方式	"generate both reasoning traces and actions in an interleaved manner"
greater synergy	更大的协同效应	"allowing for greater synergy between the two"
impressive capabilities	令人印象深刻的能力	"LLMs have demonstrated impressive capabilities"
have primarily been studied as separate topics	主要被作为独立课题研究	论文开篇定义的"割裂"问题
induce, track, and update	归纳、追踪、更新	"reasoning traces help the model induce, track, and update action plans"
handle exceptions	处理异常	"as well as handle exceptions" — 推理让模型能应对意外情况
interface with external sources	与外部信息源交互	"actions allow it to interface with external sources such as knowledge bases"
diverse set of	多样的…集合	"a diverse set of language and decision making tasks"
state-of-the-art baselines	当前最优基线	"demonstrate its effectiveness over state-of-the-art baselines"
interpretability / interpretable	可解释性 / 可解释的	"improved human interpretability and trustworthiness"
trustworthiness	可信度	人能看懂模型推理过程，因此更信任其结果
hallucination	幻觉（编造事实）	"overcomes issues of hallucination and error propagation"
error propagation	错误传播	前面的错误推理导致后续步骤全部出错
outperforms	超越 / 优于	"ReAct outperforms imitation and reinforcement learning methods"
imitation (learning)	模仿学习	从人类示范中学习策略的方法，需要大量标注数据
reinforcement (learning)	强化学习	通过试错和奖励信号学习策略的方法
in-context examples	上下文示例（few-shot）	"prompted with only one or two in-context examples"
we apply our approach	我们将此方法应用于	学术论文中描述方法应用的标准表达
demonstrate its effectiveness	证明其有效性	学术论文中描述实验验证的标准表达

§T初学者要点

论文层面：记住三件事

1 / 交织是核心词

不是"先想完再做"，而是每一步都交替。每一轮 Thought 基于上一轮 Observation 推理，每一轮 Action 基于当前 Thought 决策。这种交织让模型能根据新信息修正策略。

2 / stop_sequences 是控制权分界线

不是靠 prompt 求模型"不要编造"，而是用 API 参数 stop_sequences 硬性截断。程序决定什么时候该模型输出，什么时候该程序执行。今天的 tool_use API 是同一思想的结构化演进。

3 / 没有银弹

论文诚实报告了 CoT (SC) 在纯推理上更强的场景。 ReAct 解决的是"需要外部信息 + 多步决策"的问题。甚至提出了 ReAct + CoT-SC 融合策略 — 好论文不回避自己的边界。

代码层面：做三个实验

Experiment 1 — 跑单题观察

cd code/react-hands-on && python run_react.py --idx 0
观察完整的 Thought → Action → Observation 交替输出。重点看模型在 Thought 里规划了什么。

Experiment 2 — Ablation：去掉 Thought

修改 INSTRUCTION，切换到 Act-only prompt，跑 50 题对比。复现论文 Figure 2 的核心实验。

Experiment 3 — 改工具粒度

把 Search 和 Lookup 合并成一个工具，观察 loop 长度和分数的变化。理解工具粒度对 agent 行为的影响。

心智模型

心智模型 1 — Agent = Loop + Tools + LLM

  ReAct 论文         run_react.py              今天的 Claude Code
  ──────────        ──────────────            ──────────────────
  Loop              for i in range(1,9)       while True: ...
  Tools             Wikipedia Search/Lookup   Read/Write/Bash/Grep
  LLM               text-davinci-002          Claude Opus 4.6

  骨架没变。变的是规模。

心智模型 2 — Prompt 是程序，stop_sequences 是中断

  INSTRUCTION        = 函数签名（定义 I/O 格式）
  Few-shot 示例      = 单元测试（示范期望行为）
  当前题目 + 历史    = 运行时状态
  stop_sequences     = 中断（在特定位置夺回控制权）