期刊文献+

基于网页正文逻辑段落和长句提取的网页去重算法 被引量:1

Detection and Elimination of Similar Web Pages Based on Logical Paragraphs and Extraction of Long Sentences
在线阅读 下载PDF
导出
摘要 网页去重是提高网络检索效果的有效途径。针对现有网页去重算法的不足和网页正文的结构特征,提出一个基于网页正文逻辑段落和长句提取的网页去重算法。该方法通过用户检索关键词将网页正文物理段落结构表示成逻辑段落,在此基础上提取逻辑段落中的长句作为网页特征码实现相似网页判断。实验证明,该方法提高了篇幅短小的镜像网页和近似镜像网页的去重效果。 The technology of detection and elimination of similar web pages is an effective way to improve the effect of network retrieval. Because of the inadequacy of algorithm and the struc- tural features of webpage texts, an algorithm, based on logical paragraphs and extraction of long sentences to detect and delete similar web pages, is proposed in this paper. Through retrieval keywords, this method expresses webpage' s physical paragraph structures as logical para- graphs. Based on that, long sentences are extracted from logical paragraphs as similar charac- teristics code of webpages. The experiment results show that this method can improve the effec- tiveness of short webpages and eliminating similar webpages in retrieval.
出处 《图书情报研究》 2012年第2期41-45,共5页 Library and Information Studies
关键词 网页去重 逻辑段落 长句提取 句子相似度 detection and elimination of similar web pages logical paragraphs extraction of long sentences sentence similarity
  • 相关文献

参考文献13

二级参考文献61

共引文献74

同被引文献14

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部