摘要
实时ETL(Extract-Translation-Load)打破了传统数据仓库离线批处理模式,采用了实时流处理策略,将变更数据发送至目标仓库。研究目的是降低ETL过程处理延迟,在短时间内确保源端和目标端数据一致性。采用了纯流式数据处理框架Storm研究ETL流程。变更数据捕获(CDC)作为ETL流程的关键步骤,当面对海量数据时,传统的基于快照捕获变更的方法,因延时高阻碍了实时ETL发展。Storm默认采用轮询调度算法,忽视了工作节点间网络通信开销以及集群负载均衡的问题。针对传统变更捕获方法存在延迟高的问题,提出了基于变更数据标记捕获算法(CDMC)。针对Storm默认调度存在的问题,提出了基于非合作博弈的Storm调度算法(Game-Storm)。Storm通过组件Spout提取源端变更数据,交于逻辑处理组件Bolt,最终加载至目标仓库。综合考虑了标记捕获策略和博弈调度策略,形成了ETL流程的优化策略(GS-M-ETL)。实验分析表示,这种新方法使ETL处理延迟降低了29.5%。
Real-time ETL(Extract-Translation-Load)breaks the traditional offline batch processing mode of data warehouse,adopts real-time stream processing strategy,and sends the changed data to the target warehouse.This study focuses on reducing the processing delay of ETL process in order to ensure the data consistency between the source end and the target end in a short time.In this paper,the pure flow data processing framework storm is used to study the ETL process.Change data capture(CDC)is a key step in the ETL process.When faced with massive data,the traditional snap shot based change capture method hinders the development of real-time ETL due to its high latency.Storm adopts polling scheduling algorithm by default,which ignores the network communication overhead between working nodes and the problem of cluster load balancing.Aiming at the problem of high delay in traditional change capture methods,a new algorithm based on change data mark capture(CDMC)is proposed.Aiming at the problems of storm default scheduling,storm scheduling algorithm(Game-Storm)based on non-cooperative game is proposed.Storm extracts the source side change data through the component"Spout"and delivers it to the logical processing component"Bolt";and finally it is loaded to the target warehouse.In this paper,mark capturing strategy and game scheduling strategy are considered synthetically to form the optimization strategy of ETL process(GS-M-ETL).Experimental analysis shows that the new method can reduce the delay of ETL processing by 29.5%.
作者
马海旭
冯欣
王贵磊
孙开蔚
MA Hai-xu;FENG Xin;WANG Gui-lei;SUN Kai-wei(School of Computer Science and Technology,Changchun University of Science and Technology,Changchun 130022)
出处
《长春理工大学学报(自然科学版)》
2021年第5期93-102,共10页
Journal of Changchun University of Science and Technology(Natural Science Edition)
基金
国家重点研发项目(2017YFB1401800)。
关键词
ETL
变更捕获
调度
通信开销
负载均衡
ETL
change capturing
scheduling
communication overhead
load balancing