2026 技术信号:大模型评测进入自动化阶段:回答质量回归测试成为 AI 应用新标配

作者:袖梨 2026-07-02

2026 技术信号:大模型评测进入自动化阶段,回答质量回归测试成为 AI 应用新标配

{"type":"doc","content":[{"type":"paragraph","attrs":{"id":"e4795c7c-9524-4003-ad88-8f7020c0c3e8","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"8e1b84b2-0bf9-4659-94a8-069ebcfcdf2d","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"概述"}]},{"type":"paragraph","attrs":{"id":"9a54345d-2142-460d-9543-d79af16341e7","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"2026 年,大模型应用正在从“人工体验效果”走向“自动化质量评测”。"}]},{"type":"paragraph","attrs":{"id":"483bd318-ae86-4e4f-ae4e-8c3d904aeb0f","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"过去,团队评估一个大模型应用是否好用,通常依赖人工试问。开发者准备几个问题,看模型回答是否准确、是否流畅、是否符合业务要求。如果效果还不错,就继续上线验证。"}]},{"type":"paragraph","attrs":{"id":"9ad03669-7b40-4d3d-beda-f53f7bc7951e","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这种方式在早期 Demo 阶段可以接受。"}]},{"type":"paragraph","attrs":{"id":"6e7cbe18-8fd8-4ce7-93df-a34e3275843a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"但当大模型应用进入真实业务系统后,人工试问很快就不够用了。"}]},{"type":"paragraph","attrs":{"id":"6988a6a2-14e6-4e15-8d11-d4b9e6cd4966","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"Prompt 修改后,回答质量有没有下降?"}]},{"type":"paragraph","attrs":{"id":"a763d8a3-7680-4e0d-88a4-64506282b193","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"RAG 知识库更新后,答案是否还准确?"}]},{"type":"paragraph","attrs":{"id":"fff25863-31f3-440b-ac4a-47e6b588bb8f","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"模型版本切换后,输出格式是否稳定?"}]},{"type":"paragraph","attrs":{"id":"0163124b-0c9d-47f1-b51d-c716cc9a480e","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"Agent 工具调用结果是否被正确使用?"}]},{"type":"paragraph","attrs":{"id":"cbb601c2-ffc3-42e5-ab0d-18339c5cb9f1","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"安全规则更新后,是否还能阻止敏感输出?"}]},{"type":"paragraph","attrs":{"id":"2b9990fb-2fa0-449f-b8f0-14bdf0af3b6a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这些问题都需要自动化评测。"}]},{"type":"paragraph","attrs":{"id":"18df5846-094b-49fd-92d6-9cb927c75c92","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"因此,大模型评测系统开始成为 AI 工程化的新基础设施。它的作用不是替代人工,而是在每次模型、Prompt、知识库或工具链变化后,自动运行测试集,发现回答质量是否退化。"}]},{"type":"horizontalRule","attrs":{"id":"bf94895a-4618-4cbf-b257-adfbbe1973ec","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"3d11cd04-bede-420b-b19c-25ad37c56c76","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"一、为什么大模型应用需要回归测试?"}]},{"type":"paragraph","attrs":{"id":"bd94b2e3-879d-4700-bf2e-40c0fb1678f8","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"传统软件开发中,代码修改后需要运行测试。"}]},{"type":"paragraph","attrs":{"id":"c455000c-e25b-45df-b029-4253af2d4ffc","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"大模型应用也一样。"}]},{"type":"paragraph","attrs":{"id":"49ef34ff-3ec1-44f6-b770-f21c1152fa9b","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"只不过它测试的不是函数返回值,而是回答质量、格式稳定性、事实一致性、安全合规性和业务可用性。"}]},{"type":"paragraph","attrs":{"id":"2884e1c0-009f-48cc-a69f-7f97572d529a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"一个成熟的大模型应用,至少需要测试以下内容:"}]},{"type":"orderedList","attrs":{"id":"027d2637-a577-4912-a598-99fc3b9bf4f1","start":1,"isHoverDragHandle":false},"content":[{"type":"listItem","attrs":{"id":"758d6432-d6e7-436c-8534-ca63e4ffa82b"},"content":[{"type":"paragraph","attrs":{"id":"fdf0c4c0-ed2c-4cb1-9f7d-eb80c90c57ee","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"回答是否包含关键信息;"}]}]},{"type":"listItem","attrs":{"id":"d4dd4311-6eb3-402a-a927-b1c62a0876b3"},"content":[{"type":"paragraph","attrs":{"id":"c1a42411-cd7d-4562-9401-1f1958415391","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"回答是否符合指定格式;"}]}]},{"type":"listItem","attrs":{"id":"4838a162-7291-4125-8bf6-219763c48b5e"},"content":[{"type":"paragraph","attrs":{"id":"87dd1e01-db2f-4363-a7da-81080b697a38","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"回答是否引用了正确资料;"}]}]},{"type":"listItem","attrs":{"id":"ef108427-418f-43a1-89e4-20675d650384"},"content":[{"type":"paragraph","attrs":{"id":"0f912ae8-5f33-45a5-aa17-126ed536cd97","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"回答是否出现敏感内容;"}]}]},{"type":"listItem","attrs":{"id":"438531b1-7cd9-48e8-b28b-f1f1da1f5eed"},"content":[{"type":"paragraph","attrs":{"id":"277178c9-c1a5-46e6-8c18-c22ee25dc7ad","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"回答是否拒绝了不该回答的问题;"}]}]},{"type":"listItem","attrs":{"id":"ac8ea38c-f838-4008-bd5d-dad663331348"},"content":[{"type":"paragraph","attrs":{"id":"a9fef252-f33c-4415-b973-56ae9970b6e3","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"回答长度是否稳定;"}]}]},{"type":"listItem","attrs":{"id":"a2f24a8b-1f7b-4ae7-9d81-701176153334"},"content":[{"type":"paragraph","attrs":{"id":"bc5064ea-3d81-498f-8505-20e1f9208781","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"关键业务问题是否持续命中。"}]}]}]},{"type":"paragraph","attrs":{"id":"46fdd768-6ed7-4ab6-bb3d-8a5ee857a912","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"下面用 Python 写一个简化版大模型回答质量评测系统。"}]},{"type":"horizontalRule","attrs":{"id":"453b69fd-678c-4c79-b53d-b30a1c1e3a0b","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"eab4afc4-4133-4f4a-bf58-27b8d9d9680a","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"二、基础配置:定义评测用例"}]},{"type":"paragraph","attrs":{"id":"6569ae8a-26a8-40cc-8b37-845884bde08a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第一步是定义测试用例。"}]},{"type":"paragraph","attrs":{"id":"1d229809-114e-42fc-886d-2a1d33635204","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"每个测试用例包括用户问题、参考资料、期望关键词、禁止词和输出格式要求。"}]},{"type":"codeBlock","attrs":{"id":"d1df1b37-4aa8-40b7-a9d1-4d01382c68f8","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"import jsonnimport renfrom datetime import datetimennnTEST_CASES = [n{n"case_id": "case_001",n"question": "RAG 系统为什么需要内容清洗?",n"context": "网页中常常包含导航栏、广告、重复段落和登录提示,RAG 系统需要清洗正文后再入库。",n"expected_keywords": ["RAG", "内容清洗", "正文"],n"forbidden_keywords": ["无法回答", "不知道"],n"required_format": "paragraph"n},n{n"case_id": "case_002",n"question": "Serverless Agent 适合什么任务?",n"context": "Serverless Agent 适合短任务、事件驱动任务和按需执行的轻量智能任务。",n"expected_keywords": ["Serverless", "短任务", "事件驱动"],n"forbidden_keywords": ["数据库删除", "敏感信息"],n"required_format": "paragraph"n},n{n"case_id": "case_003",n"question": "AI 可观测性需要关注哪些指标?",n"context": "AI 可观测性需要关注响应时间、Token 消耗、RAG 召回数量、工具调用成功率和错误类型。",n"expected_keywords": ["响应时间", "Token", "工具调用"],n"forbidden_keywords": ["身份证", "密码", "密钥"],n"required_format": "otterly.cn"n}n]n"}]},{"type":"paragraph","attrs":{"id":"c4661a41-39cc-4afb-b356-99c5c6944ec7","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"测试用例是评测系统的核心。"}]},{"type":"paragraph","attrs":{"id":"d365bdb3-3769-4a0a-9035-c69e6828dfd4","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"没有稳定测试集,就很难判断模型效果是变好了还是变差了。"}]},{"type":"horizontalRule","attrs":{"id":"759cf25f-11c6-427e-94f6-debc9bd002e6","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"ef802fcf-5d8b-4fe8-bbc7-e060a1af432b","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"三、模拟模型回答"}]},{"type":"paragraph","attrs":{"id":"8dcdaecb-89fa-46eb-8c55-ed710087fc26","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第二步是模拟大模型回答。"}]},{"type":"paragraph","attrs":{"id":"999b22c1-2cdf-415a-b889-078b98105ec9","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"真实环境中,这里可以替换成模型 API 调用。为了演示流程,这里用规则生成模拟回答。"}]},{"type":"codeBlock","attrs":{"id":"0f0fd5f2-a77c-4d72-9ba5-400527128dba","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def fake_model_answer(question, context):nif "可观测性" in question:nreturn """n1. 需要关注响应时间;n2. 需要关注 Token 消耗;n3. 需要关注工具调用成功率;n4. 需要关注错误类型。n"""nnif "Serverless" in question:nreturn (n"Serverless Agent 适合短任务、事件驱动任务和按需执行场景。"n"它可以在任务触发时启动,任务完成后释放资源。"n)nnif "RAG" in question:nreturn (n"RAG 系统需要内容清洗,因为网页中常常包含导航栏、广告、"n"重复段落和登录提示。只有提取干净正文后,知识库检索效果才会更稳定。"n)nnreturn "无法回答。"n"}]},{"type":"paragraph","attrs":{"id":"ae816e9a-d920-40d7-84cc-d3870ba9d45c","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"模型回答是评测对象。"}]},{"type":"paragraph","attrs":{"id":"7370a08d-e3de-409d-a613-eb7bb680e12b","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"评测系统不关心回答是由哪个模型生成的,只关心结果是否符合质量标准。"}]},{"type":"horizontalRule","attrs":{"id":"6c2e2b84-29ec-45e3-9dbb-a5b7561e70c9","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"1e4ac3db-ecfe-4b32-a805-0351ee2c23d2","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"四、关键词命中评测"}]},{"type":"paragraph","attrs":{"id":"62a21028-f168-4c81-866f-4cac15408204","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第三步是判断回答是否包含期望关键词。"}]},{"type":"paragraph","attrs":{"id":"deb8913b-41ba-4e53-896b-878a8898b516","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这类评测虽然简单,但对于业务关键问题非常有效。"}]},{"type":"codeBlock","attrs":{"id":"f145db5f-9ed6-4d4d-9879-78f31026f8c0","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def evaluate_expected_keywords(answer, expected_keywords):nhits = []nnfor keyword in expected_keywords:nif keyword in answer:nhits.append(keyword)nnscore = 0nnif expected_keywords:nscore = round(nlen(hits) / len(expected_keywords) * 100,n2n)nnreturn {n"score": score,n"hits": hits,n"missing": [nkeyword for keyword in expected_keywordsnif keyword not in hitsn]n}n"}]},{"type":"paragraph","attrs":{"id":"ab31f364-38fd-4e49-8011-62150e1bd80c","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"关键词命中不代表回答完全正确,但可以保证关键业务信息没有丢失。"}]},{"type":"paragraph","attrs":{"id":"3994a158-f68a-4c49-b160-393d13611953","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"对于企业知识库问答来说,这一步非常重要。"}]},{"type":"horizontalRule","attrs":{"id":"3adf99ab-bebc-480b-9228-6384cd4455b8","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"07cee07d-257f-4d63-9458-ba270a4540ce","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"五、禁止词检测"}]},{"type":"paragraph","attrs":{"id":"83171789-f47b-4f34-a1f1-570a49ae867c","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第四步是禁止词检测。"}]},{"type":"paragraph","attrs":{"id":"2fa67d19-744f-4014-b35b-50812de74f04","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"如果回答中出现敏感词、错误表述或不允许输出的内容,就需要扣分或直接判定失败。"}]},{"type":"codeBlock","attrs":{"id":"f7826202-ac83-44fd-a700-620ba98cf1cd","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def evaluate_forbidden_keywords(answer, forbidden_keywords):nhits = []nnfor keyword in forbidden_keywords:nif keyword in answer:nhits.append(keyword)nnpassed = len(hits) == 0nnreturn {n"passed": passed,n"hits": hits,n"score": 100 if passed else 0n}n"}]},{"type":"paragraph","attrs":{"id":"8c2accc8-f1f6-46d5-b858-3af5dcf186c3","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"禁止词检测可以作为安全评测的第一层。"}]},{"type":"paragraph","attrs":{"id":"ccb5bc2d-595f-4a65-91fd-f8d8cbc61a1d","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"真实环境中,还可以继续接入敏感信息识别、越权内容识别和政策规则检测。"}]},{"type":"horizontalRule","attrs":{"id":"edc104b1-9633-4595-a736-5e3d546807b3","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"b9d559ce-fc6a-4d8f-98d3-89f9022daad8","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"六、格式评测"}]},{"type":"paragraph","attrs":{"id":"3d496429-4710-46f0-9d64-7a28f9a80866","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第五步是检查输出格式。"}]},{"type":"paragraph","attrs":{"id":"e080f5c8-64bc-4b95-a16b-b20fcf953246","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"很多业务系统要求模型输出列表、JSON、表格或固定段落。如果格式不稳定,后续系统就很难解析。"}]},{"type":"codeBlock","attrs":{"id":"af80ccc7-0973-407f-be11-fea07d8f971f","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def evaluate_format(answer, required_format):nif required_format == "list":nhas_list = bool(nre.search(r"^s*d .", answer, re.MULTILINE)n)nnreturn {n"passed": has_list,n"score": 100 if has_list else 30,n"format": required_formatn}nnif required_format == "paragraph":nlines = [nline for line in answer.split("n")nif line.strip()n]nnpassed = len(lines) <= 3nnreturn {n"passed": passed,n"score": 100 if passed else 60,n"format": required_formatn}nnreturn {n"passed": True,n"score": 100,n"format": required_formatn}n"}]},{"type":"paragraph","attrs":{"id":"3dc18a8a-21a0-4fdc-91c2-edb0e4532902","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"格式稳定性是大模型应用上线后的常见问题。"}]},{"type":"paragraph","attrs":{"id":"8a030640-1d1f-4b3e-ad91-5d9139b334d3","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"尤其是当模型结果要进入自动化流程时,格式比文采更重要。"}]},{"type":"horizontalRule","attrs":{"id":"17ccb126-5dcf-4f7a-a740-e3248e8e0c67","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"490cdaac-db36-4c34-893f-0a82c2ec206c","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"七、单条用例评测"}]},{"type":"paragraph","attrs":{"id":"667c8301-ee06-4af1-9852-b4e7c2926ec9","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第六步是对单条测试用例进行综合评测。"}]},{"type":"paragraph","attrs":{"id":"cb4f19f3-c182-4e9e-a811-c0700427fa02","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"系统会同时计算关键词分、禁止词分和格式分。"}]},{"type":"codeBlock","attrs":{"id":"53d2cacc-26ee-48d2-b5e6-b9d13112f946","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def evaluate_case(case):nanswer = fake_model_answer(nquestion=case["question"],ncontext=case["context"]n)nnkeyword_result = evaluate_expected_keywords(nanswer,ncase["expected_keywords"]n)nnforbidden_result = evaluate_forbidden_keywords(nanswer,ncase["forbidden_keywords"]n)nnformat_result = evaluate_format(nanswer,ncase["required_format"]n)nnfinal_score = round(nkeyword_result["score"] * 0.5n forbidden_result["score"] * 0.3n format_result["score"] * 0.2,n2n)nnpassed = final_score >= 80 and forbidden_result["passed"]nnreturn {n"case_id": case["case_id"],n"question": case["question"],n"answer": answer,n"keyword_result": keyword_result,n"forbidden_result": forbidden_result,n"format_result": format_result,n"final_score": final_score,n"passed": 30655.t.kuaisou.comn}n"}]},{"type":"paragraph","attrs":{"id":"a7e5f5a9-7703-4348-9078-31c0bf063ce5","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"综合评分可以让评测更接近真实质量判断。"}]},{"type":"paragraph","attrs":{"id":"ec2d7f0d-0ac4-4a17-a3a6-3ef0162c4858","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"如果只看关键词,可能忽略安全问题;如果只看安全,又可能忽略业务效果。"}]},{"type":"horizontalRule","attrs":{"id":"51e9c13f-4fa6-49d9-bac7-6437a140ae9a","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"e476d9b3-7242-4efa-b3a4-507f1945b6f3","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"八、批量评测与报告生成"}]},{"type":"paragraph","attrs":{"id":"3dd78b79-6a9c-4c46-a34a-c374153fc3c8","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"最后一步是批量运行测试集,并生成评测报告。"}]},{"type":"codeBlock","attrs":{"id":"cc1059db-9319-4ee6-a965-73a24a98e18c","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def run_evaluation(test_cases):ndetails = []nnfor case in test_cases:nresult = evaluate_case(case)ndetails.append(result)nntotal = len(details)npassed_count = sum(n1 for item in detailsnif item["passed"]n)nnavg_score = 0nnif total > 0:navg_score = round(nsum(item["final_score"] for item in details) / total,n2n)nnpass_rate = 0nnif total > 0:npass_rate = round(npassed_count / total * 100,n2n)nnreport = {n"report_name": "大模型回答质量自动评测报告",n"total_cases": total,n"passed_cases": passed_count,n"pass_rate": pass_rate,n"avg_score": avg_score,n"details": details,n"generate_time": datetime.now().isoformat(30523.t.kuaisou.com)n}nnreturn reportnnnif __name__ == "__main__":nreport = run_evaluation(TEST_CASES)nnprint(json.dumps(nreport,nensure_ascii=False,nindent=2n))n"}]},{"type":"horizontalRule","attrs":{"id":"4a40be83-1291-411e-8359-f496ccc4d340","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"f6fc9a0c-955e-419c-b350-c9f2504ee3bd","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"九、趋势判断"}]},{"type":"paragraph","attrs":{"id":"d02b3da5-ed77-46df-9db0-5af761e86b8f","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"从这套流程可以看到,大模型评测正在成为 AI 应用工程化的重要环节。"}]},{"type":"paragraph","attrs":{"id":"e34caa72-09c7-4afa-8069-800f82bd31d0","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"过去,团队更多依赖人工试问;未来,团队需要像测试代码一样测试模型回答。"}]},{"type":"paragraph","attrs":{"id":"fa02cd45-acc0-4f84-b744-a3732407c11f","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"Prompt 修改、模型切换、知识库更新、工具链调整,都应该触发自动化评测。"}]},{"type":"paragraph","attrs":{"id":"e5d781d9-4722-4e68-9b0d-5c1a51b43753","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这会让大模型应用从“感觉效果不错”走向“有数据证明稳定”。"}]},{"type":"paragraph","attrs":{"id":"597df2d5-d393-4303-a9c6-c876048c2c8e","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"未来,AI 应用的竞争不会只看模型能力,也会看评测体系是否完善。谁能更早建立自动化评测系统,谁就更容易把大模型应用稳定地推向生产环境。"}]}]}","createTime":1782903612,"ext":{"closeTextLink":0,"comment_ban":0,"description":"","focusRead":0},"favNum":0,"html":"","isOriginal":0,"likeNum":0,

相关文章

精彩推荐