摘要
本文论述了Web用户访问模式挖掘中的数据预处理,主要提出了数据预处理中如何识别会话的一种改进算法。该方法通过使用三个因素来构造会话:①根据先验知识,确定会话时间阈值识别会话;②根据页面访问时间统计分布,确定相邻网页访问时间间隔阈值识别会话;③页面内容及站点结构确定页面重要程度识别会话。实验结果表明,相对于传统的单一方法进行会话识别的方法,该方法能够准确的识别会话,更为合理有效。
This paper mainly discusses the data preparation of web usage mining, an improved algorithm for session identification in data preparation is proposed. This algorithm is according to three methods: 1.Define the session by session threshold, which was determined by experiences. 2.Define the session by page threshold, which was based on time distribution of all the page. 3.Define the session by importance of page and website' s structure. Compared with the traditional single method, this approach presented more accurately, it is more reasonable and effective.
出处
《科技广场》
2008年第7期85-87,共3页
Science Mosaic
关键词
访问模式挖掘
数据预处理
会话识别
阈值
网站结构
Web Log Data Mining
Data Preparation
Session Identification
Threshold
Website Structure