摘要
网页去重是提高网络检索效果的有效途径。针对现有网页去重算法的不足和网页正文的结构特征,提出一个基于网页正文逻辑段落和长句提取的网页去重算法。该方法通过用户检索关键词将网页正文物理段落结构表示成逻辑段落,在此基础上提取逻辑段落中的长句作为网页特征码实现相似网页判断。实验证明,该方法提高了篇幅短小的镜像网页和近似镜像网页的去重效果。
The technology of detection and elimination of similar web pages is an effective way to improve the effect of network retrieval. Because of the inadequacy of algorithm and the struc- tural features of webpage texts, an algorithm, based on logical paragraphs and extraction of long sentences to detect and delete similar web pages, is proposed in this paper. Through retrieval keywords, this method expresses webpage' s physical paragraph structures as logical para- graphs. Based on that, long sentences are extracted from logical paragraphs as similar charac- teristics code of webpages. The experiment results show that this method can improve the effec- tiveness of short webpages and eliminating similar webpages in retrieval.
出处
《图书情报研究》
2012年第2期41-45,共5页
Library and Information Studies
关键词
网页去重
逻辑段落
长句提取
句子相似度
detection and elimination of similar web pages
logical paragraphs
extraction of long sentences
sentence similarity