
{"type":"doc","content":[{"type":"paragraph","attrs":{"id":"59ffe9b2-dd89-4b1f-981b-e461501fd26c","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"19c0737f-12f2-4894-bf92-c4598413524b","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"概述"}]},{"type":"paragraph","attrs":{"id":"d0ba499e-0f48-4aa7-9eec-540ae0a690a5","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"2026 年,企业知识库正在从纯文本走向多模态文档解析。"}]},{"type":"paragraph","attrs":{"id":"f57b1b0b-b817-458c-afdc-b2b2b92c36c3","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"过去,RAG 系统处理最多的是网页、Markdown、Word 文档和普通文本。开发者把文档切分成片段,再写入向量库,最后让大模型基于召回内容回答问题。"}]},{"type":"paragraph","attrs":{"id":"0878d9ff-8f48-415d-b815-28ac10f0302c","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"但在真实企业场景中,大量知识并不是纯文本。"}]},{"type":"paragraph","attrs":{"id":"b3172076-ad73-4a02-a54f-22e732c2cd48","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"合同可能是 PDF,报表可能是 Excel,方案可能是 PPT,设备说明可能包含图片,财务材料可能包含表格,技术文档可能包含流程图和截图。"}]},{"type":"paragraph","attrs":{"id":"c1b0c06b-a9c1-4df7-936f-e26e2835a946","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"如果 AI 知识库只能处理文字,就会丢失大量关键信息。"}]},{"type":"paragraph","attrs":{"id":"acaa314c-ed7f-4c1c-8fce-80f42bcc85ad","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"因此,多模态文档解析开始变得重要。"}]},{"type":"paragraph","attrs":{"id":"8abbc4ca-470a-45ef-8122-9d630e399815","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"它的目标是把 PDF、图片、表格、标题、段落和元数据统一转成结构化内容,再进入知识库和大模型应用。"}]},{"type":"paragraph","attrs":{"id":"ea00ab70-49f8-4e35-8f9e-580952f47cca","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这意味着,企业 AI 的数据入口正在变宽。"}]},{"type":"horizontalRule","attrs":{"id":"013bf2e4-b7e1-4b63-9fff-434d011e63d4","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"a73b5ad3-42f0-42b8-bd3c-1f140fd1a9fa","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"一、为什么多模态解析重要?"}]},{"type":"paragraph","attrs":{"id":"4dc5a3d9-1386-4c0f-bf55-f43f9a49b9e9","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"企业资料通常不是简单文本。"}]},{"type":"paragraph","attrs":{"id":"dc22209b-d120-4c45-96d8-f07d1af6f87e","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"例如:"}]},{"type":"orderedList","attrs":{"id":"701a40f0-d6d1-4823-903f-e7d876d2f233","start":1,"isHoverDragHandle":false},"content":[{"type":"listItem","attrs":{"id":"8f64f5a6-ed38-49c5-81be-5f81824536b2"},"content":[{"type":"paragraph","attrs":{"id":"3a7a0d30-4285-4c7c-8219-cad7500cae2b","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"PDF 中包含章节标题、正文和表格;"}]}]},{"type":"listItem","attrs":{"id":"b1b87c1c-f46b-41af-bc63-7386f49edf52"},"content":[{"type":"paragraph","attrs":{"id":"0b55c96e-d41e-412d-a34d-801001af4523","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"PPT 中包含页面结构、图示和说明;"}]}]},{"type":"listItem","attrs":{"id":"5c94a3dd-9371-49f7-959a-c100b30bf86f"},"content":[{"type":"paragraph","attrs":{"id":"bdafeb70-ee02-4d2f-a827-fb21d827da0b","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"图片中可能包含截图、流程图和文字;"}]}]},{"type":"listItem","attrs":{"id":"2313eb18-7045-492a-87f3-2f040d6bdee9"},"content":[{"type":"paragraph","attrs":{"id":"72562487-17ea-4142-acba-5a6474e5237d","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"Excel 中包含指标、金额、时间和分类;"}]}]},{"type":"listItem","attrs":{"id":"eb7d7e92-9da3-4644-aaa9-161752269067"},"content":[{"type":"paragraph","attrs":{"id":"be0376ef-3ba2-4264-90cb-cf1d8136c0a1","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"扫描件中可能包含合同、发票和签章。"}]}]}]},{"type":"paragraph","attrs":{"id":"d1671101-65f1-4755-a50f-bdd074256b0d","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这些内容如果不能被解析,就无法进入 AI 系统。"}]},{"type":"paragraph","attrs":{"id":"676a0fb7-9328-44e9-a11c-608182d65a58","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"所以,AI 知识库需要从“文本解析”升级为“文档解析”。"}]},{"type":"paragraph","attrs":{"id":"9d8c5ad5-8f73-4be9-a29e-803c1415d4ae","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"下面用 Python 写一个简化版多模态文档解析流程,重点演示结构设计和处理链路。"}]},{"type":"horizontalRule","attrs":{"id":"43f32a98-2efe-4a0d-8fc8-75056fe4e941","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"ef0ec0b3-0993-4543-952a-d75ffa4bd66d","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"二、基础结构:定义文档块"}]},{"type":"paragraph","attrs":{"id":"e7233603-128c-4cb9-8d4c-5bca3c9d58fb","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第一步是定义文档块结构。"}]},{"type":"paragraph","attrs":{"id":"061b33b7-bdf9-4cd0-bee3-c30cf8f536ce","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"无论原始内容来自 PDF、图片还是表格,最终都可以转成统一的文档块。"}]},{"type":"codeBlock","attrs":{"id":"f810e595-0ce4-4d5e-916a-2979475b62a6","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"import jsonnimport hashlibnfrom datetime import datetimenfrom typing import List, Dictnnnclass DocumentBlock:ndef __init__(nself,nblock_type: str,ncontent: str,npage: int = None,nposition: Dict = None,nmetadata: Dict = Nonen):nself.block_type = block_typenself.content = contentnself.page = pagenself.position = position or {}nself.metadata = metadata or {}nself.block_id = self.build_id()nself.created_at = datetime.now().isoformat()nndef build_id(self):nraw = f"{self.block_type}-{self.page}-{self.content[:50]}"nnreturn hashlib.md5(nraw.encode("utf-8")n).hexdigest()nndef to_dict(self):nreturn {n"block_id": self.block_id,n"block_type": self.block_type,n"content": self.content,n"page": self.page,n"position": self.position,n"metadata": self.metadata,n"created_at": self.created_atn}n"}]},{"type":"paragraph","attrs":{"id":"4d87d126-b144-486b-9867-f9f0e5065369","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"统一文档块是多模态解析的核心。"}]},{"type":"paragraph","attrs":{"id":"1a7bd398-45c0-49a9-8d7c-02bca3e3b936","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"只要数据变成统一结构,后续切分、检索、摘要和问答就更容易处理。"}]},{"type":"horizontalRule","attrs":{"id":"83b86884-4f9b-44e2-9431-e2560caeaeb1","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"175023b7-9379-484c-aa5f-18f16d748123","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"三、模拟 PDF 解析:提取页面文本"}]},{"type":"paragraph","attrs":{"id":"452d0d8e-c7ad-480e-bc1e-a06438a9fc80","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第二步是模拟 PDF 页面解析。"}]},{"type":"paragraph","attrs":{"id":"f9af6afb-f3dd-48ae-8e57-bf9b6e14be3f","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这里用列表模拟 PDF 页面的文本内容。真实场景中,可以替换为 PDF 解析库或 OCR 服务。"}]},{"type":"codeBlock","attrs":{"id":"c49e9753-3eba-49cd-84ac-ccbdc3e07af3","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def parse_pdf_pages(pdf_pages: List[str]):nblocks = []nnfor page_index, page_text in enumerate(pdf_pages, start=1):nlines = page_text.split("n")nnfor line_index, line in enumerate(lines):nline = line.strip()nnif not line:ncontinuennblock_type = "paragraph"nnif len(line) < 30 and not line.endswith("。"):nblock_type = "title"nnblock = DocumentBlock(nblock_type=block_type,ncontent=line,npage=page_index,nposition={n"line": line_indexn},nmetadata={n"source_type": "pdf"n}n)nnblocks.append(block.to_dict())nnreturn blocksn"}]},{"type":"paragraph","attrs":{"id":"1aa0398f-ef8c-4723-9382-f9cf2db4e9ad","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"PDF 解析的难点在于保留结构。"}]},{"type":"paragraph","attrs":{"id":"3019488d-702b-4000-a57e-e484ce413a78","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"标题、正文、页码和位置关系,都可能影响后续问答效果。"}]},{"type":"horizontalRule","attrs":{"id":"7e0f8099-52ba-41ac-b23f-c1ae58606c53","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"9e890f3f-ed54-442b-b194-2e480972599f","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"四、模拟表格解析:转成可读文本"}]},{"type":"paragraph","attrs":{"id":"8acb4417-61ad-44a0-9c38-87a8525787de","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第三步是模拟表格解析。"}]},{"type":"paragraph","attrs":{"id":"0bad5d08-038a-40a4-afc4-c78c594c7bc7","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"表格不能简单当成普通文本处理,否则容易丢失列名和行关系。"}]},{"type":"codeBlock","attrs":{"id":"8cc2342a-dcc9-4b75-9477-afc998ddb61a","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def parse_table(table_name: str, headers: List[str], rows: List[List[str]]):nblocks = []nnfor row_index, row in enumerate(rows):nvalues = [otterly.cn]nnfor header, value in zip(headers, row):nvalues.append(f"{header}: {value}")nncontent = f"{table_name},第 {row_index 1} 行," ";".join(values)nnblock = DocumentBlock(nblock_type="table_row",ncontent=content,npage=None,nposition={n"row": row_index 1n},nmetadata={n"source_type": "table",n"table_name": table_name,n"headers": headersn}n)nnblocks.append(block.to_dict())nnreturn blocksn"}]},{"type":"paragraph","attrs":{"id":"94f26421-97da-47cb-8589-5f50bf96be64","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"表格解析的关键,是把结构化数据转成大模型能理解的自然语言描述。"}]},{"type":"paragraph","attrs":{"id":"44b65bfc-a377-454a-b16c-9fbd6c9bc188","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这样既保留字段关系,也方便后续检索。"}]},{"type":"horizontalRule","attrs":{"id":"00cb7b30-8f7e-439e-a6d0-77cd86f4bbc5","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"7e03c8f7-d6dc-4c58-8167-ec91db25fec5","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"五、模拟图片 OCR:提取图片文字"}]},{"type":"paragraph","attrs":{"id":"2234e77c-2d6a-434d-b0bf-7c26c0d887df","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第四步是模拟图片 OCR。"}]},{"type":"paragraph","attrs":{"id":"7419d138-c487-4df1-91a7-5175e79446d0","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"真实系统中,可以使用 OCR 模型识别图片中的文字。这里用模拟函数表示。"}]},{"type":"codeBlock","attrs":{"id":"3cd7172b-80c7-4154-9a29-8ffb3d652530","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def parse_image_ocr(image_name: str, ocr_text: str):nblocks = []nnlines = ocr_text.split("n")nnfor index, line in enumerate(lines):nline = line.strip()nnif not line:ncontinuennblock = DocumentBlock(nblock_type="image_text",ncontent=line,npage=None,nposition={n"line": index 1n},nmetadata={n"source_type": "image",n"image_name": image_namen}n)nnblocks.append(block.to_dict(30658.t.kuaisou.com))nnreturn blocksn"}]},{"type":"paragraph","attrs":{"id":"0152f1e6-49a5-4b75-ac5c-692c3989aa21","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"图片解析可以让截图、扫描件、流程图说明进入 AI 知识库。"}]},{"type":"paragraph","attrs":{"id":"1b7909f2-908c-44ae-ad0b-18bcdfac1317","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这对于企业历史资料、合同扫描件和产品截图非常重要。"}]},{"type":"horizontalRule","attrs":{"id":"0a604489-8194-43c9-ac56-1557fc6a9616","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"ab0cd674-eb92-4a11-9966-1415831e02d8","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"六、内容清洗:过滤低质量块"}]},{"type":"paragraph","attrs":{"id":"a80019cc-4a9a-40b5-860e-208a358491e7","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第五步是内容清洗。"}]},{"type":"paragraph","attrs":{"id":"ea7d64d2-6828-49eb-ae4f-ac03430da6cf","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"不是所有解析出来的内容都适合进入知识库。过短、重复或无意义的内容需要过滤。"}]},{"type":"codeBlock","attrs":{"id":"f89384bd-0866-48bc-9824-83e5e225bd08","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def clean_blocks(blocks):ncleaned = []nseen = set()nnfor block in blocks:ncontent = block["content"].strip()nnif len(content) < 5:ncontinuenncontent_hash = hashlib.md5(ncontent.encode("utf-8")n).hexdigest()nnif content_hash in seen:ncontinuennseen.add(content_hash)nnblock["content"] = contentncleaned.append(block)nnreturn cleanedn"}]},{"type":"paragraph","attrs":{"id":"b6df85d9-36b9-4a26-ba29-ebdcda938111","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"清洗是多模态文档解析进入知识库前的重要步骤。"}]},{"type":"paragraph","attrs":{"id":"fbdf32d9-bc5d-48f4-af3b-18a4e404bbd8","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"如果不清洗,向量库中会出现大量重复和无意义片段。"}]},{"type":"horizontalRule","attrs":{"id":"436461d1-26d7-4a9c-9925-2e9972c59644","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"983d7c52-196d-4569-a595-8715c6c71b7a","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"七、构建知识片段:统一输出格式"}]},{"type":"paragraph","attrs":{"id":"8d8d633b-4a35-40fb-9a42-eb14dd8e842a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第六步是把文档块转成知识片段。"}]},{"type":"paragraph","attrs":{"id":"40409624-df65-461c-a435-83152e81389d","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这里保留 block 类型、来源类型、页码和内容,方便后续检索时展示引用信息。"}]},{"type":"codeBlock","attrs":{"id":"d62dad51-5fdd-41fd-9e8a-d63c4f9e4358","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def build_knowledge_chunks(blocks):nchunks = []nnfor index, block in enumerate(blocks):nchunk = {n"chunk_id": block["block_id"],n"chunk_index": index,n"text": block["content"],n"block_type": block["block_type"],n"source_type": block["metadata"].get("source_type"),n"page": block.get("page"),n"metadata": block["metadata"],n"created_at": datetime.now().isoformat()n}nnchunks.append(chunk)nnreturn chunksn"}]},{"type":"paragraph","attrs":{"id":"3f9e317b-d60a-4ef4-86e9-82f4ac3a79f0","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"统一知识片段可以进入 RAG 系统。"}]},{"type":"paragraph","attrs":{"id":"77583814-af76-479b-bbf2-72623f6ccc6a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"无论原始数据来自 PDF、图片还是表格,最终都可以通过同一套检索接口使用。"}]},{"type":"horizontalRule","attrs":{"id":"7d879dc4-878c-41fd-8967-c24cecc3a6f8","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"c8441a93-6576-49e4-a224-dfb8a40334c3","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"八、生成解析报告"}]},{"type":"paragraph","attrs":{"id":"263d68bb-525a-4ac8-8972-18b2b94f477a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"第七步是生成解析报告。"}]},{"type":"paragraph","attrs":{"id":"750f2435-9f70-40ee-8b52-88a1c748a406","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"报告可以统计不同类型文档块数量,帮助判断本次解析质量。"}]},{"type":"codeBlock","attrs":{"id":"b8c7d874-eb53-4e99-82c6-42e82dd46668","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"def generate_parse_report(chunks):ntype_count = {}nsource_count = {}nnfor chunk in chunks:nblock_type = chunk["block_type"]nsource_type = chunk["source_type"]nntype_count[block_type] = type_count.get(block_type, 0) 1nsource_count[source_type] = source_count.get(source_type, 0) 1nnreturn {n"report_name": "多模态文档解析报告",n"total_chunks": len(chunks),n"block_type_count": type_count,n"source_type_count": source_count,n"sample_chunks": chunks[:5],n"generate_time": datetime.now().isoformat()n}n"}]},{"type":"paragraph","attrs":{"id":"635b9742-f0a3-4136-a2ac-a4f83dc5b3cf","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"解析报告可以帮助团队判断文档是否被正确处理。"}]},{"type":"paragraph","attrs":{"id":"62e43c7c-3532-437c-a524-ba5f55645db9","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"如果表格或图片内容没有被解析出来,就需要调整解析策略。"}]},{"type":"horizontalRule","attrs":{"id":"c2f0a53a-7157-483e-92e0-c05748a19004","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"e70938bb-badb-4937-a332-f0d8fb2a6286","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"九、运行示例:解析 PDF、表格和图片"}]},{"type":"paragraph","attrs":{"id":"e4dff471-3516-45c0-bf6b-7b00d8ff2d9e","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"最后写一个完整示例。"}]},{"type":"codeBlock","attrs":{"id":"11bd0e40-0f11-4721-a523-351f7443c520","language":"javascript","theme":"atom-one-dark","runtimes":0,"isHoverDragHandle":false,"key":"","languageByAi":"javascript"},"content":[{"type":"text","text":"if __name__ == "__main__":npdf_pages = [n"""n企业 AI 知识库建设方案n本方案介绍 RAG 系统的数据接入、内容清洗和向量检索流程。n系统需要支持 PDF、图片和表格等多种数据来源。n""",n"""n技术架构n数据进入系统后,需要先解析为统一文档块。n随后进行清洗、切分、向量化和索引构建。n"""n]nntable_headers = ["指标", "数值", "说明"]nntable_rows = [n["解析文档数", "120", "本周新增解析文档"],n["失败数量", "3", "主要由图片质量过低导致"],n["平均耗时", "2.3秒", "单文档平均解析时间"]n]nnimage_text = """n系统流程图n上传文档 -> 内容解析 -> 向量入库 -> 智能问答n"""nnall_blocks = []nnall_blocks.extend(nparse_pdf_pages(pdf_pages)n)nnall_blocks.extend(nparse_table(ntable_name="文档解析统计表",nheaders=table_headers,nrows=table_rowsn)n)nnall_blocks.extend(nparse_image_ocr(nimage_name="architecture.png",nocr_text=image_textn)n)nncleaned_blocks = clean_blocks(all_blocks)nchunks = build_knowledge_chunks(cleaned_blocks)nreport = generate_parse_report(chunks)nnprint(json.dumps(nreport,nensure_ascii=False,nindent=2n))n"}]},{"type":"horizontalRule","attrs":{"id":"4e9e5c77-0954-426f-be22-80ce380852c5","isHoverDragHandle":false}},{"type":"heading","attrs":{"id":"11ab2b05-e2d8-48d2-8fe5-757cbe838489","textAlign":"inherit","indent":0,"level":2,"isHoverDragHandle":false},"content":[{"type":"text","text":"十、趋势判断"}]},{"type":"paragraph","attrs":{"id":"10bcfb5e-3899-4fa6-b8a2-188ef81cce5a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"从这套流程可以看到,企业 AI 知识库正在从文本处理走向多模态解析。"}]},{"type":"paragraph","attrs":{"id":"099de01e-cc94-4e26-8d08-2ad16e1527c5","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"过去,RAG 系统主要处理网页和文档正文;未来,PDF、图片、表格、截图、扫描件和演示文稿都会成为知识来源。"}]},{"type":"paragraph","attrs":{"id":"28003978-4ca9-4f67-8789-b18c1f86a6fe","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"这会让 AI 系统覆盖更多企业资料。"}]},{"type":"paragraph","attrs":{"id":"d77f98d3-75be-4a02-b913-959cf65a3300","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"但同时,解析质量也会变得更重要。文档结构是否保留,表格关系是否正确,图片文字是否识别准确,都会影响最终问答效果。"}]},{"type":"paragraph","attrs":{"id":"b7930254-0be5-41a5-87bc-14c912bdaab6","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"未来,企业 AI 的竞争不只是模型能力竞争,也会是数据解析能力竞争。"}]},{"type":"paragraph","attrs":{"id":"69efa640-e6c0-40de-983f-6698dd08100a","textAlign":"inherit","indent":0,"color":null,"background":null,"isHoverDragHandle":false},"content":[{"type":"text","text":"谁能把复杂文档稳定转成可检索知识,谁就能让大模型真正进入企业知识体系。"}]}]}","createTime":1782902777,"ext":{"closeTextLink":0,"comment_ban":0,"description":"","focusRead":0},"favNum":0,"html":"","isOriginal":0,"likeNum":0,