摘要
近年来,以Twitter和新浪微博为代表的微博客正在世界范围内流行起来.根据微博的特点,提出一种与特定主题(比如某种产品)相关的话题发现和排序的新方法.首先,在互联网上收集并格式化出现了感兴趣的词的微博.对于这些微博中的所有词汇,综合考虑影响力、突发性和相关性3个要素对其重要性进行评估.其次,对词的重要性做出估量后,以含有同一关键词的微博的集合为输入文档训练LDA模型.然后通过对主题关键词的概率分布的推导,实现词的聚类和主题的挖掘.这一方法可以克服微博的长度限制所带来的数据稀缺性问题.最后,通过真实数据集上的实验表明了该方法的有效性.
Micro-blogging services,like Twitter and Sina Weibo,are getting popular across the world.In this paper a new approach is proposed to get information from micro-blogs about what people are thinking about a product,a company or an organization.First,messages in which people mention the item(e.g.aproduct)of interest are collected and formalized.Then,keywords cooccurring with it are analyzed to estimate their importance.In this procedure,three factors-influence,burstiness and relevance-are considered to balance topicsnovelty and specificity.Influence score of a keyword is based on its probability of being viewed by many people,burstiness score is based on whether it appears more times recently than before,and relevance score is based on its co-occurrence relationship with the product of interest.After keywords ranking process,micro-blogs containing the same keywords are aggregated to a term profile as input to train LDA model,by which the data sparsity caused by the length limit of micro-blog is weaken.The validity of this approach is proved in real case study.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2013年第S1期179-185,共7页
Journal of Computer Research and Development
基金
国家"八六三"高技术研究发展计划基金项目(2012AA040911)
关键词
微博客
关键词排序
主题发现
LDA
主题模型
文本挖掘
micro-blog
keyword ranking
topic detection
latent Dirichlet allocation(LDA)
topic model
text mining