Agent 系列（15）：Agent 记忆系统进阶——短期：长期：压缩：三层记忆架构

作者：袖梨 2026-07-04

记忆系统不只是"存聊天记录"

把对话历史塞进 prompt——这是记忆的最粗糙形式。真实系统的记忆需求要复杂得多：

用户在第 3 轮提到了自己的城市，第 10 轮问天气时 Agent 应该知道去哪里查
用户上周告诉过系统他们用的是哪个产品套餐，本次新会话不应该再问一遍
对话进行了 20 轮，context window 快撑不住了，怎么压缩但不丢关键信息

这是三个不同的问题，需要三种不同的机制。

三层记忆架构

短期记忆   会话内        MemorySaver checkpointer    多轮问答
长期记忆   跨会话        持久化 KV 存储 / 向量库      个性化
历史压缩   会话内保护     摘要替换                    长对话 token 守卫

Demo 1：短期记忆——MemorySaver

LangGraph 的 MemorySaver 是最轻量的短期记忆实现：把对话历史绑定到一个 thread_id，下次调用时自动注入。

from langgraph.checkpoint.memory import MemorySaver
from langchain_core.runnables import RunnableConfigcheckpointer = MemorySaver()
stateful_agent = create_react_agent(
    model=llm,
    tools=[get_weather, calculator, get_product_info],
    checkpointer=checkpointer,
)THREAD_A: RunnableConfig = {"configurable": {"thread_id": "thread-alice"}}# Turn 1: 告知名字和城市
r1 = stateful_agent.invoke(
    {"messages": [HumanMessage("Hi, I'm Alice. I live in Beijing.")]},
    config=THREAD_A,
)# Turn 2: 用同一个 thread_id，历史自动附带
r2 = stateful_agent.invoke(
    {"messages": [HumanMessage("What's the weather like where I live today?")]},
    config=THREAD_A,
)

真实运行结果：

Thread A Turn 1: Hello Alice! How can I assist you today?Thread A Turn 2: Sure, I can help you with that. I will need to know the 
                 city you are in. Could you please provide me with the 
                 name of your city?
                 Tools used: []Thread B (no context): I can help with that. Could you please provide 
                       your city name?
                       Tools used: []

Thread A 和 Thread B 给出了同样的回答。

这是一个重要的发现：MemorySaver 的基础设施是正常的——Thread A 的第二次调用实际上带着完整的历史（两条消息），Thread B 只有一条。但 GLM-4-Flash 没有把"I live in Beijing"（第一轮）和"where I live"（第二轮）连起来。这是模型能力问题，不是 MemorySaver 的问题。

同样的 prompt，GPT-4 或 Claude 会直接去查北京天气；能力弱的模型可能需要更明确的表述（"What's the weather in Beijing?"）才能触发工具调用。

短期记忆的两层含义：

基础设施层：MemorySaver 确保历史消息被传递
模型层：LLM 能否从历史中提取并使用上下文 ← 取决于模型能力

Demo 2：长期记忆——跨会话事实存储

跨会话记忆的核心思路：用 LLM 从对话中提取关键事实，存入持久化存储，下次会话时注入系统提示词。

Session 1 — 提取并存储：

# 模拟持久化存储（生产环境替换为数据库或向量库）
LONG_TERM_STORE: dict[str, dict[str, str]] = {}def extract_facts(conversation: str) -> dict[str, str]:
    resp = llm.invoke([
        SystemMessage(
            "Extract key facts about the user. "
            'Return ONLY JSON: {"city": "...", "plan": "..."}'
        ),
        HumanMessage(f"Conversation:n{conversation}"),
    ])
    # 解析 JSON 响应
    ...

Session 1 对话内容：

User: I'm Alice. I'm based in Shanghai and my team uses WonderBot Pro.
User: We mainly use the API for data processing — about 50,000 calls a month.

提取结果并存入：

{'name': 'alice', 'city': 'shanghai', 'team': 'wonderbot pro', 'api_calls': '50000'}

Session 2 — 注入并使用：

stored = load_user_facts("user-alice")
facts_text = "; ".join(f"{k}={v}" for k, v in stored.items())personalized_prompt = (
    "You are a helpful assistant. "
    f"Known facts about this user: {facts_text}. "
    "Use these facts to personalize your responses without asking the user to repeat themselves."
)personalized_agent = create_react_agent(model=llm, tools=TOOLS, prompt=personalized_prompt)

Session 2 运行结果：

User: What's the weather like in my city today?
Agent: The current weather in Shanghai is 22 degrees Celsius with cloudy conditions.
Tools used: ['get_weather']

Agent 直接查询了 Shanghai，没有问"你住哪"。这是因为 city=shanghai 已经在系统提示词里了——模型不需要从对话历史中推断，而是直接读取了显式提供的事实。

这也是为什么长期记忆比短期记忆更可靠：事实以明确的 KV 格式注入，不依赖模型从历史中做推断。

Demo 3：历史压缩

对话越来越长，token 消耗和响应延迟会线性增长。压缩策略：设置 token 阈值，超过后用摘要替换历史。

COMPRESSION_THRESHOLD = 250   # tokensdef summarize_messages(messages: list) -> str:
    history_text = "n".join(
        f"{'User' if isinstance(m, HumanMessage) else 'Agent'}: {str(m.content)[:150]}"
        for m in messages
        if isinstance(m, (HumanMessage, AIMessage)) and not getattr(m, "tool_calls", None)
    )
    resp = llm.invoke([
        SystemMessage(
            "Summarize this conversation in 2-3 sentences. "
            "Preserve all key facts: names, cities, numbers, product names."
        ),
        HumanMessage(f"Conversation:n{history_text}"),
    ])
    return str(resp.content)# 在每轮对话后检查 token 数
if total_tokens > COMPRESSION_THRESHOLD:
    summary = summarize_messages(messages)
    messages = [SystemMessage(f"Conversation summary so far: {summary}")]

Demo 3 的真实结果：5 轮对话（Bob 在深圳，评估 WonderBot Pro，8 位开发者，年费 299*12=3588）总 token 保持在 198，未超过 250 阈值，压缩未触发。

最终验证：

User: Quickly summarize: who am I, what city, and what's the annual API cost?
Agent: You are Bob, from Shenzhen, and the annual API cost for WonderBot Pro 
       for 8 developers is $3,588.

10 条消息历史完整保留，Agent 记住了所有关键事实。压缩机制是安全阀，不是每次都触发——对话足够短时，原始历史比摘要更精确。阈值设置建议：4k token（接近模型推理效率下降的经验值）。

三种模式的实测对比

模式	Demo 1 结果	Demo 2 结果	关键差异
短期记忆	MemorySaver 存储正常，但 GLM-4-Flash 未能利用隐含上下文	—	依赖模型推断能力
长期记忆	—	直接调 get_weather(Shanghai)，零追问	显式 KV 注入，不依赖推断
压缩	—	—	安全阀机制，按需触发