摘要
搜索引擎作为互联网信息获取的入口,实现高效、准确的信息获取非常重要,爬虫作为搜索引擎的上游,其重要性不言而喻,特别是大数据时代信息更新频繁,如何在第一时间获取新闻是实现爬虫时效性的重要因素。为了充分利用有限资源,提升带宽利用率,设计一种基于历史数据预测的爬虫调度算法。该算法通过抓取网站历史,更新频次积累数据,使用随机森林回归建立模型,并在系统中实现爬虫调度。实验结果表明,该策略在抓取新链的命中率上提升了46%,平均成本降低了11%,平均抓取延时降低了14%。
As an entry point for the Internet to obtain information,it is very important for the search engine to obtain information effi⁃ciently and accurately.The importance of the crawler as the upstream of the search engine cannot be ignored.Especially in the era of big data,the information is updated frequently,how to get news timely is a key factor to ensure crawler timeliness.In order to make full use of limited resources and improve the utilization of bandwidth,a crawler scheduling algorithm based on historical data predic⁃tion is designed.The algorithm accumulates data by crawling the historical update frequency of the website,uses random forest regres⁃sion to build the model,and implements the strategy in the system.The experimental results show that the strategy has increased the hit rate of the new link by 46%,the average cost by 11%,and the average grab delay by 14%.
作者
韩瑞昕
HAN Rui-xin(Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China)
出处
《软件导刊》
2020年第1期108-112,共5页
Software Guide
关键词
搜索引擎
爬虫调度
回归预测
随机森林
search engine
crawler scheduling
regression prediction
random forest