期刊文献+

网页去噪:研究综述 被引量:18

A Survey of Web Page Cleaning Research
在线阅读 下载PDF
导出
摘要 互联网的快速发展已经使得网页数据成为目前各种应用与研究的重要数据源之一.网页数据包含各种内容,如广告、导航条、相关链接、正文等,然而对于不同的研究和应用来说,并非所有内容都是必需的,相反地,不相关的内容反而会影响研究和应用的效果和效率,所以网页去噪是一个基础问题,且是目前热点研究的问题.因此很有必要对网页去噪领域进行总结,以便更好地进行深入研究.首先说明了网页去噪的必要性,并对网页去噪进行了定义和分类,概述了多种网页去噪的方法和框架,然后对评估网页去噪算法所使用的数据集和方法进行了总结,最后讨论了该领域存在的问题和今后的研究方向. The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page de-noise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第12期2025-2036,共12页 Journal of Computer Research and Development
基金 国家自然科学基金项目(70903008 60933004) 国家"八六三"高技术研究发展计划基金项目(2007AA01Z154 2009AA01Z143) CNCI搜索引擎项目(CNCI2008-122)
关键词 网页去噪 数据挖掘 网络挖掘 信息检索 万维网 Web page cleaning data mining Web mining information retrieval WWW
  • 相关文献

参考文献47

  • 1Fetterly D,Manasse M,Najork M,et al.A large-scale study of the evolution of Web pages[J].Software:Practice and Experience,2004,34(2):213-237.
  • 2Gibson D,Punera K,Tomkins A.The volume and evolution of Web page templates[C]//Proc of the 14th Int Conf on World Wide Web.New York:ACM,2005:830-839.
  • 3Vieira K,Silva A S D,Pinto N,et al.A fast and robust method for Web page template detection and removal[C]//Proc of the 15th ACM Int Conf on Information and Knowledge Management.New York:ACM,2006:258-267.
  • 4Chen L,Ye S,Li X.Template detection for large scale search engines[C]//Proc of the 2006 ACM Symp on Applied Computing.New York:ACM,2006:1094-1098.
  • 5Bar-Yossef Z,Rajagopalan S.Template detection via data mining and its applications[C]//Proc of the 11th Int Conf on World Wide Web.New York:ACM,2002:580-591.
  • 6Manku G S,Jain A,Sarma A D.Detecting near-duplicates for Web crawling[C]//Proc of the 16th Int Conf on World Wide Web.New York:ACM,2007:141-150.
  • 7Manasse F M,Najork M.Detecting phrase-level duplication on the World Wide Web[C]//Proc of the 28th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2005:170-177.
  • 8Carvalho A L D C,Chirita P A,Moura E S D,et al.Site level noise removal for search engines[C]//Proc of the 15th Int Conf on World Wide Web.New York:ACM,2006:73-82.
  • 9Coughlan J,Yuille A,English C,et al.Efficient deformable template detection and localization without user initialization[J].Computer Vision Image Understanding,2000,22(78):303-319.
  • 10Wang Jiying,Lochovsky F H.Data-rich section extraction from HTML pages[C]//Proc of the 3rd Int Conf on Web Information Systems Engineering(Workshops).Los Alamitos,CA:IEEE Computer Society,2002:313-322.

二级参考文献55

共引文献166

同被引文献192

引证文献18

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部