摘要
互联网技术的迅速发展,使得web承载的信息量呈现出爆炸式增长的趋势,因此web日志的数据量也越来愈大。如何存储、处理大规模数据就成了新的挑战。云计算技术的出现,为这类问题的解决提供了一种思路。云计算将数据通过网络分布到集群的各个计算节点上,从而完成大规模数据的存储和运算。Hadoop是一个用于构建云计算平台的流行的开源框架,广泛应用于海量数据的处理。但利用Hadoop处理数据,用户必须自己开发Map/Reduce程序。这种程序处于比较低的层次,用户不容易掌握,而且难于维护。Hive是一个基于Hadoop的开源数据仓库工具,它能够将文件映射成数据表,并提供类SQL语句,简化了用户的开发。利用Hadoop、Hive设计了一个用于处理web日志分析的系统,既充分利用了Hadoop的海量数据处理的能力,又降低了开发的难度。通过与单机实验的对比,证明系统是有效的和有价值的。
With the rapid development of Internet technology, the amount of information carried by the web shows explosive growth trend. With this correspondence, web log data is becoming bigger and bigger. Cloud computing technology provides a way to solve this kind of problem. Cloud computing technology completes storing and computing of massive data by distributing data to each computing node of cluster through the network. Hadoop is an open source framework which used widely in massive data processing. However, users have to develop their own Map/Reduce procedure if they want to process data using Hadoop. The Map/Reduce procedure is not easy to grasp and maintain, because it is at a relatively low level. Hive is an open source data warehouse tool which is based on the Hadoop. Hive can map the file into a data table, and provide SQL-Like statements, which simpli- fies the user's development. The web log analysis system based on Hadoop and Hive make full use of the data processing ability and reduces the difficulty of development. The system proved to be effective and valuable according to an experiment contrasted with the stand-alone machine.
出处
《广西大学学报(自然科学版)》
CAS
CSCD
北大核心
2011年第A01期314-317,共4页
Journal of Guangxi University(Natural Science Edition)