摘要
文中致力于开发一种基于资源感知的分布式爬虫任务调度方法,以优化分布式环境中各节点的系统资源利用,提升爬虫任务的执行效率。该方法通过引入资源感知调度算法和节点优先级管理,实现对节点中CPU、内存、网络等资源的监测,以便均衡调度爬虫任务,即确保爬虫任务在资源利用率较低的节点上执行,从而有效减轻各个节点之间资源过度占用和不均衡问题。另外,该方法引入的Flask提高了可扩展性,实现了可视化爬虫监控平台。实验结果表明,文中提出的方法在提高爬虫任务执行效率和适应性方面取得了显著效果,为分布式爬虫系统的进一步优化提供了有益指导。
This paper aims to develop a distributed crawler task scheduling method based on resource awareness,so as to optimize the system resource utilization of each node in a distributed environment and improve the execution efficiency of crawler task.By introducing resource awareness scheduling algorithm and node priority management,the monitoring of resources of CPU,memory and network in nodes is achieved to balance the scheduling of crawler task,that is,to ensure that crawler tasks are executed on nodes with low resource utilization,so as to effectively relieve the excessive resource occupation and imbalance among nodes.In addition,the introduction of Flask has improved the scalability of the method and achieved a visual crawler monitoring platform.Experimental results show that the proposed method can achieve significant results in improving the efficiency and adaptability of crawler task execution,which provides useful guidance for the further optimization of distributed crawler systems.
作者
张军
魏继桢
李钰彬
ZHANG Jun;WEI Jizhen;LI Yubin(School of Information Engineering,East China University of Technology,Nanchang 330013,China)
出处
《现代电子技术》
北大核心
2024年第9期86-90,共5页
Modern Electronics Technique
基金
国家自然科学基金资助项目(62162002)
国家自然科学基金资助项目(61662002)
江西省自然科学基金资助项目(20212BAB202002)。
关键词
分布式爬虫
任务调度
资源感知
FLASK
数据采集
资源利用率
distributed crawler
task scheduling
resource awareness
Flask
data collection
resource utilization rate