摘要
当前,勘察报告作为重要的工程设计依据,其中有大量表格和文本信息未被有效识别利用,为进一步打通专业软件研发的数据壁垒,有效识别和提取勘察报告信息迫在眉睫。针对该现状,本文设计Word表格及文本信息提取算法,并提出信息提取、显示和利用的一整套解决方案。基于文件读写库遍历Word表格,计算每个单元格的行列合并数,进而实现Word表格精准识别至Excel;基于文档自动化技术,记录Word表格范围,反向搜索获取表格标题。基于栈数据结构和匹配算法,遍历Word段落进行大纲匹配和范围计算,实现Word文本大纲信息识别;通过程序后台模拟复制粘贴操作将数据呈现在软件界面上。引入多线程机制,防止勘察报告信息提取操作阻塞主线程,引入并行分析机制,加速文本分析效率,进而提升软件的综合用户体验。以某一实际工程勘察报告为案例进行分析,验证该算法的适用性和准确性。
At present,investigation report is an important basis for engineering design,in which a large number of table and text information haven’t been effectively identified and utilized.In order to further break through the data barriers of professional software development,it’s urgent to effectively identify and extract the information of investigation report.This paper proposed an algorithm and a complete set of solutions in this regard.Based on the file reading and writing library,the Word tables were traversed,and the row and column spans of each cell were calculated,which realized the accurate recognition of the Word table to Excel.Based on document automation technology,the Word table ranges were recorded and the table titles were obtained by reverse searching.Based on the stack data structure,the Word paragraphs were traversed for outline matching and range calculation,and the Word text information recognition was realized.The data was presented on the software interface through the simulation of copy-and-paste operations in the background.The multi-threading mechanism was introduced to prevent the information extraction operation from blocking the main thread,and the parallel analysis mechanism was introduced to boost the efficiency of text analysis,thereby improving the comprehensive user experience of the software.Finally,the applicability and accuracy of this algorithm was verified by a real engineering investigation report.
作者
李浩
Li Hao(Engineering Research Center of Railway Industry on Digital and Intelligent Survey and Design System,China Railway Siyuan Survey and Design Group Co.,Ltd.,Hubei,Wuhan 430063,China;Digital Intelligence Business Unit,China Railway Siyuan Survey and Design Group Co.,Ltd.,Hubei,Wuhan 430063,China)
出处
《铁道技术标准(中英文)》
2024年第3期39-46,共8页
Railway Technical Standard(Chinese & English)
基金
国家重点研发计划(2021YFB2600400)
中国铁建股份有限公司科技研发计划(2022-A02)
中铁第四勘察设计院集团有限公司科技研发项目(2022D001)。
关键词
算法
表格信息提取
文本信息提取
多线程
algorithm
table information extraction
text information extraction
multi-thread