Harness实战：上下文压缩——三层策略让Agent永不断档

Agent工作越久，messages越胖。一个1000行的cat输出占4000 token，模型早就看过了，后续每轮还在为它付费。三层压缩策略——micro_compact静默替换旧结果、auto_compact在token超阈值时LLM摘要、compact工具让模型主动触发——让上下文永远可控。

写在前面

上一篇我们给 Agent 加了后台任务——Fire-and-Forget，Agent 提交长耗时工作后立即返回，继续和用户对话。

但到这里，一个一直存在的问题已经不能再忽视了——上下文在膨胀。

Agent 每调一次工具，messages[] 里就多一坨 tool_result。读一个 1000 行文件就是 4000 token，跑一次 pytest 输出 2000 token。模型早就看过这些结果并做出了响应，但后续每轮 API 调用，它们还躺在 messages 里，反复计费、反复占空间。

问题：模型看过的东西，为什么还要反复带着？

一个典型的 Agent 任务："帮我看看这个项目的测试覆盖率，然后修复失败的测试。"

第1轮: read_file → main.py              (输出: 200行, ~800 token)
第2轮: bash → pytest --cov              (输出: 80行,  ~400 token)
第3轮: read_file → test_main.py         (输出: 150行, ~600 token)
第4轮: edit_file → test_main.py         (输出: "Edited", ~10 token)
第5轮: bash → pytest                    (输出: 30行,  ~150 token)
第6轮: 告诉用户 "修好了"

到第 6 轮时，messages 里还带着第 1 轮 main.py 的 200 行完整内容。模型在第 1 轮就看过了，第 2 轮就开始做别的事了。但你每轮都在为这 800 token 付费。

6 轮下来，累计浪费：

第1轮结果在后续5轮中重复传输: 800 × 5 = 4,000 token
第2轮结果在后续4轮中重复传输: 400 × 4 = 1,600 token
第3轮结果在后续3轮中重复传输: 600 × 3 = 1,800 token
───────────────────────────────────
仅6轮对话，重复传输的废 token:    ~7,400 token

这还只是一个小任务。一个 30 轮的重构任务，重复传输的 token 量轻松过 10 万——你在为模型已经消化过的信息反复付费。

核心洞见：tool_result 只需让模型见一次

模型调用工具 → 拿到 tool_result → 给出 assistant 响应。

这个 assistant 响应，就是模型"消化"了 tool_result 的证明。它已经把 200 行 main.py 读完了，脑子里有了印象，做出了判断。

下一轮再带着原文发过去，模型不会获得任何新信息。

所以策略很简单：模型见过一次之后，把旧的 tool_result 替换为占位符。

替换前: {"type": "tool_result", "content": "200行main.py的完整内容..."}
替换后: {"type": "tool_result", "content": "[Previous: used read_file]"}

800 token → 10 token。省了 99%。

如果后续模型真的需要重看这个文件？它可以再调一次 read_file。工具是幂等的——再读一次成本远低于每轮都带着。

三层压缩架构

单靠占位符替换能撑很久，但面对真正的长对话，消息数本身也会累积——几十条消息的元信息（assistant 回复、占位符、用户输入）加起来也不少。

所以需要三层递进：

┌───────────────────────────────────────────────────┐
│                   agent_loop                       │
│                                                    │
│  每轮开头 → Layer 1: micro_compact (静默替换)      │
│              将旧 tool_result 替换为占位符          │
│              成本: 零（纯字符串操作）               │
│              ↓                                     │
│  token > 阈值 → Layer 2: auto_compact (LLM摘要)   │
│              备份完整对话 → LLM 生成摘要            │
│              成本: 一次 LLM 调用                    │
│              ↓                                     │
│  模型主动调用 → Layer 3: compact 工具 (手动触发)   │
│              同 Layer 2 的摘要机制，按需执行         │
│                                                    │
│  激进度:  低 ──────────────────────────→ 高        │
└───────────────────────────────────────────────────┘

层级	触发方式	成本	信息损失
micro_compact	每轮自动	零	几乎无（占位符保留了"用过什么工具"）
auto_compact	token 超阈值	一次 LLM 调用	细节压缩为摘要，完整对话备份到磁盘
compact 工具	模型主动 / 用户手动	同上	同 auto_compact

配置

在 agentic-demo.py 的全局变量区新增压缩相关配置：

import time  # 新增 import，transcript 需要时间戳

# ── 上下文压缩配置 ──────────────────────────────────────
THRESHOLD = 50_000               # 超过此值自动触发 auto_compact
TRANSCRIPT_DIR = WORKDIR / ".transcripts"
KEEP_RECENT = 3                  # micro_compact 保留最近 N 个已消化的 tool_result
PRESERVE_RESULT_TOOLS = {"read_file"}  # 这些工具的结果不压缩（参考材料）
SUMMARY_SYSTEM = (               # auto_compact 摘要时的 system prompt
    "Summarize this conversation for continuity. Include: "
    "1) What was accomplished, 2) Current state, 3) Key decisions made. "
    "Be concise but preserve critical details."
)

五个配置项各管一件事：

变量	默认值	作用
`THRESHOLD`	50,000	超过此值自动触发 auto_compact
`TRANSCRIPT_DIR`	`.transcripts/`	完整对话备份目录
`KEEP_RECENT`	3	micro_compact 保留最近 N 个已消化 tool_result 的完整内容
`PRESERVE_RESULT_TOOLS`	`{"read_file"}`	这些工具的结果永远不替换
`SUMMARY_SYSTEM`	(见上)	auto_compact 调 LLM 摘要时的 system prompt

PRESERVE_RESULT_TOOLS 是个关键设计——read_file 的结果是参考材料，如果被替换为占位符，模型会被迫重新 read_file，而重新读取又会生成新的 tool_result，形成无限循环。保留 read_file 结果，就是保留参考材料不被压缩。

同时在 SYSTEM 提示词末尾加一句，让模型知道可以主动压缩：

SYSTEM = f"""...（原有内容不变）

When context feels long or cluttered, use the `compact` tool to compress the conversation."""

token 估算

def estimate_tokens(messages: list) -> int:
    """粗略估算 token 数：每 4 个字符 ≈ 1 token。"""
    return len(str(messages)) // 4

不需要精确——差个 20% 无所谓，目的是判断"该不该压缩了"。精确计数需要 tokenizer，引入额外依赖，不值得。

Layer 1：micro_compact——静默替换旧 tool_result

实现

def micro_compact(messages: list) -> list:
    # 找到最后一条 assistant 消息的位置——只替换模型已经"消化"过的结果
    last_assistant_idx = -1
    for i in range(len(messages) - 1, -1, -1):
        if messages[i]["role"] == "assistant":
            last_assistant_idx = i
            break

    # 只收集 last_assistant_idx 之前的 tool_result（模型已见过的）
    consumed_results = []
    for msg_idx, msg in enumerate(messages):
        if msg_idx >= last_assistant_idx:
            break
        if msg["role"] == "user" and isinstance(msg.get("content"), list):
            for part_idx, part in enumerate(msg["content"]):
                if isinstance(part, dict) and part.get("type") == "tool_result":
                    consumed_results.append((msg_idx, part_idx, part))
    if len(consumed_results) <= KEEP_RECENT:
        return messages

    # 从 assistant 消息中收集 tool_use_id → tool_name 映射
    tool_name_map = {}
    for msg in messages:
        if msg["role"] == "assistant":
            content = msg.get("content", [])
            if isinstance(content, list):
                for block in content:
                    if hasattr(block, "type") and block.type == "tool_use":
                        tool_name_map[block.id] = block.name

    # 替换旧的已消化结果（保留最后 KEEP_RECENT 个）
    to_clear = consumed_results[:-KEEP_RECENT]
    for _, _, result in to_clear:
        if not isinstance(result.get("content"), str) or len(result["content"]) <= 100:
            continue
        tool_id = result.get("tool_use_id", "")
        tool_name = tool_name_map.get(tool_id, "unknown")
        if tool_name in PRESERVE_RESULT_TOOLS:
            continue
        result["content"] = f"[Previous: used {tool_name}]"
    return messages

四个保护机制

这个实现有四个保护机制，防止过度压缩导致循环或信息丢失：

1. 只替换模型已经"消化"过的结果

last_assistant_idx = ...  # 最后一条 assistant 消息的位置
for msg_idx, msg in enumerate(messages):
    if msg_idx >= last_assistant_idx:
        break  # 这之后的 tool_result 模型还没见过，不能动

模型一次调用 5 个工具，5 个 tool_result 都追加到 messages 末尾。下一轮 micro_compact 时，这 5 个结果后面还没有 assistant 消息——说明模型还没见过它们。不动。

2. 在已消化结果中，按条目计数保留最后 3 个

consumed_results = [...]  # 只包含模型已见过的 tool_result
to_clear = consumed_results[:-KEEP_RECENT]

不是按 assistant 轮次计数，而是按 tool_result 条目计数。一轮可能有多个 tool_result（模型一次调了好几个工具），按条目计数更精确。

3. 跳过短内容（≤100 字符）

if not isinstance(result.get("content"), str) or len(result["content"]) <= 100:
    continue

"Edited test_main.py" 只有 20 字符，替换为占位符省不了多少 token，反而丢失了有用信息。只替换真正占空间的大输出。

4. 永远不替换 read_file 的结果

if tool_name in PRESERVE_RESULT_TOOLS:
    continue

read_file 的输出是参考材料——模型可能在后续多轮中反复引用文件内容。如果替换了，模型就得重新 read_file，产生新的大输出，又被替换，又重读……死循环。

tool_name 反查

注意 tool_name_map 的构建方式：从 assistant 消息中找 tool_use 块，用 block.id 映射到 block.name。

为什么不直接在 tool_result 里存 tool_name？因为 Anthropic API 的 tool_result 格式只有 tool_use_id，没有 tool_name。需要反查 assistant 消息中对应的 tool_use 块。

还要注意 hasattr(block, "type") 的检查——Anthropic SDK 返回的是对象（有 .type、.name 属性），不是 dict。

执行效果

messages 状态（第5轮 LLM 调用前）:

  [0] user: "帮我看看测试覆盖率"
  [1] assistant: → tool_use(read_file, "main.py")
  [2] user: tool_result("200行代码...")
  [3] assistant: → tool_use(bash, "pytest --cov")
  [4] user: tool_result("覆盖率报告...")
  [5] assistant: → tool_use(read_file, "test_main.py")
  [6] user: tool_result("150行测试代码...")
  [7] assistant: → tool_use(edit_file, "test_main.py")
  [8] user: tool_result("Edited test_main.py")      ← ≤100字符，跳过

micro_compact:
  last_assistant_idx = 7  (messages[7] 是最后一条 assistant)
  只收集 index < 7 的已消化 tool_result:
    [2] "200行代码..."    → 保留（read_file 在 PRESERVE_RESULT_TOOLS 中）
    [4] "覆盖率报告..."    → "[Previous: used bash]"   (400→10 token)
    [6] "150行测试代码..." → 保留（read_file 在 PRESERVE_RESULT_TOOLS 中）
  [8] 的 tool_result 在 last_assistant_idx 之后 → 模型还没见过，不动!

  consumed_results=3, KEEP_RECENT=3 → 不触发替换?
  ❌ 错了，这里 consumed 刚好等于 3，不触发。
  等下一轮 [7] assistant 后面再追加新的 tool_result 时才会开始替换。

关键：只替换模型已消化的结果。如果模型一次调了 5 个工具，这 5 个结果在下一轮全部可见——因为它们后面还没有 assistant 消息。

Layer 2：auto_compact——LLM 摘要

触发条件

micro_compact 延缓了膨胀，但如果对话轮次够多，消息的元信息（assistant 回复文本、占位符、用户输入）本身也在累积。当 token 超过阈值时，需要更激进的手段。

实现

def auto_compact(messages: list) -> list:
    """保存完整对话到磁盘，用 LLM 生成摘要，替换所有消息。"""
    # ① 保存 transcript
    TRANSCRIPT_DIR.mkdir(exist_ok=True)
    transcript_path = TRANSCRIPT_DIR / f"transcript_{int(time.time())}.jsonl"
    with open(transcript_path, "w") as f:
        for msg in messages:
            f.write(json.dumps(msg, default=str) + "\n")
    print(f"[transcript saved: {transcript_path}]")

    # ② 用 LLM 生成摘要（截取最后 80000 字符防止超长）
    conversation_text = json.dumps(messages, default=str)[-80000:]
    response = client.messages.create(
        model=MODEL,
        system=SUMMARY_SYSTEM,
        messages=[{"role": "user", "content": conversation_text}],
        max_tokens=2048,
    )
    summary = next(
        (block.text for block in response.content if hasattr(block, "text")),
        "No summary generated.",
    )

    # ③ 用压缩后的单条消息替换全部历史
    return [
        {
            "role": "user",
            "content": (
                f"[Conversation compressed. Transcript: {transcript_path}]\n\n"
                f"{summary}"
            ),
        },
    ]

三步走：

① 备份 — 把完整 messages 逐行写入 JSONL 文件。default=str 处理 SDK 对象的序列化。压缩是有损的，但原始对话永远在磁盘上。

② 摘要 — 把 messages 序列化后截取最后 80000 字符（防止摘要输入本身超长）。摘要指令作为 system=SUMMARY_SYSTEM 传入（系统级指令），对话内容作为 user 消息——职责分离更清晰。摘要保留目标、进度、关键决策——这些是继续工作真正需要的。

③ 替换 — 返回只有一条消息的列表，附带备份文件路径。几万 token 的对话 → 一条摘要消息（1000-2000 token）。压缩比 10:1 到 50:1。

为什么不保留最近几条消息？

一些实现会保留最近 N 条消息再加摘要。我们的实现更简单——全部替换为一条摘要。原因是 micro_compact 已经在持续保护最近的 tool_result，auto_compact 只在 token 真正超标时才触发。到触发时，整段对话都该被摘要了。

Layer 3：compact 工具——模型主动触发

工具定义

COMPACT_TOOL = {
    "name": "compact",
    "description": "Trigger manual conversation compression.",
    "input_schema": {
        "type": "object",
        "properties": {
            "focus": {"type": "string", "description": "What to preserve in the summary"}
        },
    },
}

focus 参数让模型指定压缩时重点保留什么——比如"保留关于数据库迁移的讨论"。当前实现中 focus 未被使用（摘要靠通用 prompt），但预留了扩展点。

在 agent_loop 中的特殊处理

compact 工具不走通用的 TOOL_HANDLERS dispatch。它需要在所有工具执行完之后、追加 tool_result 之后，再做一次摘要。用一个 flag 标记：

results = []
manual_compact = False
for block in response.content:
    if block.type == "tool_use":
        if block.name == "compact":
            manual_compact = True
            output = "Compressing..."
        else:
            handler = TOOL_HANDLERS.get(block.name)
            try:
                output = handler(**block.input) if handler else f"Unknown tool: {block.name}"
            except Exception as e:
                output = f"Error: {e}"
        results.append({"type": "tool_result", "tool_use_id": block.id,
                        "content": str(output)})
messages.append({"role": "user", "content": results})

# Layer 3: compact 触发后，立即摘要并返回
if manual_compact:
    print("[manual compact]")
    messages[:] = auto_compact(messages)
    return

关键细节：先把 tool_result（包括 compact 的占位结果）追加到 messages，再做摘要，最后 return 退出 agent_loop。return 的原因是摘要后 messages 只剩一条，模型需要从用户那里重新获取指令才能继续。

集成：三层压缩在循环中的位置

def agent_loop(messages: list):
    while True:
        # ★ Layer 1: micro_compact — 每轮静默替换旧 tool_result
        micro_compact(messages)

        # ★ Layer 2: auto_compact — token 超阈值时，LLM 摘要压缩
        if estimate_tokens(messages) > THRESHOLD:
            print("[auto_compact triggered]")
            messages[:] = auto_compact(messages)

        response = client.messages.create(
            model=MODEL, system=SYSTEM,
            messages=messages, tools=TOOLS, max_tokens=8000,
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason != "tool_use":
            return

        results = []
        manual_compact = False
        for block in response.content:
            if block.type == "tool_use":
                if block.name == "compact":
                    manual_compact = True
                    output = "Compressing..."
                else:
                    handler = TOOL_HANDLERS.get(block.name)
                    try:
                        output = handler(**block.input) if handler else f"Unknown tool: {block.name}"
                    except Exception as e:
                        output = f"Error: {e}"
                print(f"> {block.name}:")
                print(str(output)[:200])
                results.append({"type": "tool_result", "tool_use_id": block.id,
                                "content": str(output)})
        messages.append({"role": "user", "content": results})

        # ★ Layer 3: 模型主动触发压缩
        if manual_compact:
            print("[manual compact]")
            messages[:] = auto_compact(messages)
            return

注意 messages[:] = auto_compact(messages) 的写法——切片赋值。不是 messages = auto_compact(...)（那只会改局部变量），而是替换列表的内容，让调用方的 history 列表也同步更新。

四个触发点：

触发方式	位置	谁触发
micro_compact	每轮循环开头	系统自动
auto_compact	LLM 调用前检查 token	系统自动（超阈值）
compact 工具	工具 dispatch 阶段	模型主动调用
`/compact` 命令	REPL 主循环	用户手动

用户手动触发

在 REPL 主循环中新增 /compact 命令：

elif query.strip() == "/compact":
    if not history:
        print("  没有可压缩的对话")
        continue
    print("[manual compact]")
    history[:] = auto_compact(history)
    print(f"  ✓ 当前上下文: ~{estimate_tokens(history)} tokens\n")
    continue

用户感觉对话变慢了、觉得上下文乱了，随时一键压缩。底层逻辑和自动压缩完全一样——备份 + 摘要 + 替换。

运行效果

❯ Read every Python file in the harness/ directory

> read_file: harness.py
> read_file: multi-turn.py
> read_file: tool-function-calling.py
> read_file: plan-demo.py
> read_file: harness-subagent.py
> read_file: harness-parallel.py
> read_file: harness-background.py
> read_file: agentic-demo.py

  [micro_compact] 替换了 5 个旧 tool_result

I've read all 8 Python files. Here's a summary...

❯ Now analyze the architecture and write a report

  [micro_compact] 替换了 2 个旧 tool_result

> bash: wc -l harness/*.py
> write_file: ANALYSIS_REPORT.md

[auto_compact triggered]
[transcript saved: .transcripts/transcript_1712736042.jsonl]

Report written to ANALYSIS_REPORT.md.

❯ /compact

[manual compact]
[transcript saved: .transcripts/transcript_1712736100.jsonl]
  ✓ 当前上下文: ~900 tokens

注意时间线：

读 8 个文件 — 每轮 micro_compact 静默替换旧的 bash/write_file 结果，但 read_file 结果被保留（在 PRESERVE_RESULT_TOOLS 中）
写报告时触发 auto_compact — token 超阈值，完整对话备份到 .transcripts/，所有消息压缩为一条摘要
用户手动 /compact — 进一步清理，上下文降到 900 token

为什么需要三层，不是一层

一层不够，因为压缩有代价：

层级	代价	频率	效果
micro_compact	几乎为零（字符串替换）	每轮	省掉旧 tool_result 的重复传输
auto_compact	一次 LLM 调用 + 磁盘 IO	偶尔	把整段对话压缩为摘要
compact 工具	同 auto_compact	极少	模型觉得乱了，主动整理

如果只有 auto_compact，每次都要花一次 LLM 调用来做摘要。micro_compact 用近乎免费的字符串替换延缓了摘要的触发时机——能不花钱的地方绝不花钱。

如果只有 micro_compact，面对真正的长对话它迟早顶不住——旧消息的元信息（assistant 回复文本、占位符本身、用户输入）累积到一定量，也会撑满窗口。

三层递进，就像一个渐进式垃圾回收器：

micro_compact = 增量 GC，每轮跑，成本极低
auto_compact = Full GC，偶尔触发，成本较高但效果好
compact 工具 = 手动 GC，模型自己觉得需要时触发

关键设计决策

为什么 KEEP_RECENT 是 3 而不是 1？

最初我们尝试过 keep_recent=1——只保留最后一个 tool_result。理论上最激进、最省 token。

但实际运行中发现了一个问题：模型在连续工具调用中需要看到前几步的结果来决定下一步做什么。 只保留 1 个时，模型读了文件 A，准备编辑，但编辑前 A 的内容已经被替换为占位符了——于是模型重新 read_file，读完又被替换，编辑又需要重读……形成无限循环。

KEEP_RECENT=3 给了模型足够的"工作记忆"：当前步骤 + 前两步的结果。大多数工具链不超过 3 步连续依赖。

为什么 read_file 不能被压缩？

read_file 的输出是参考材料——模型在后续多轮中可能反复引用文件内容来做编辑、比对、分析。如果被替换为 [Previous: used read_file]，模型就被迫重新 read_file，新的输出又会在下一轮被压缩，再次触发重读……

把 read_file 加入 PRESERVE_RESULT_TOOLS 集合，就是告诉 micro_compact："这些是参考材料，别碰。"

为什么用切片赋值 `messages[:] = ...`？

# 错误：只改了局部变量，调用方的 history 不变
messages = auto_compact(messages)

# 正确：替换列表内容，调用方的 history 同步更新
messages[:] = auto_compact(messages)

agent_loop(messages) 接收的是 history 列表的引用。普通赋值 messages = ... 让局部变量指向新列表，但 history 还是老列表。切片赋值 messages[:] = ... 修改的是列表本身，history 和 messages 指向同一个对象，所以外面也能看到变化。

小结

组件	之前	之后
旧 tool_result	永久占据上下文，每轮反复付费	micro_compact 替换为占位符（read_file 除外）
长对话处理	撑满就报错	auto_compact 自动摘要
模型自主权	无法管理自己的上下文	compact 工具主动触发
手动控制	无	`/compact` 命令
对话备份	无	`.transcripts/` 完整 JSONL 保存
理论对话长度	受窗口限制	无限

核心原则：工具结果只让模型见一次。见过即替换，需要时再查。但参考材料（read_file）永远保留。

这也是 Claude Code 里 /compact 命令的核心思路：不是"忘记"，是把信息从活跃上下文移到磁盘——需要时随时取回。

Harness实战：上下文压缩——三层策略让Agent永不断档

On this page