摘要
近年来,主题情感联合模型成为了无监督学习领域的一项重要研究内容,在文本主题挖掘和情感分析等方面均有实际应用。然而,在现实场景中,微博因其文字短小、结构不完整等特征,给主题情感联合模型带来了一定的挑战。因此,围绕微博主题情感模型展开相关的研究与改进工作,目前较为流行的主题情感模型——TSMMF模型(Topic Sentiment Model Based on Multi-feature Fusion)中引入了词向量技术,运用多元高斯分布从词向量空间中快速采样邻近词语,并替换掉原Dirichlet多项式分布产生的单词,从而将共现频率低、信息量少的单词转变成突出主题、信息明确的单词,同时使用最近邻搜索算法来进一步提升模型处理大型微博语料库的运行速度,进而提出了GWE-TSMMF模型。对比实验结果表明,GWE-TSMMF模型的平均F1值约为0.718,相比原模型和现有的主流词嵌入主题情感模型(WS-TSWE模型和HST-SCW模型),其微博情感极性的分析效果均有显著提升。
In recent years,the topic sentiment model as an important research in the field of unsupervised learning,has been used in text topic mining and sentiment analysis.However,Weibo has brought some challenges to the topic sentiment model because of its short text and in complete structure.Therefore,the related research and improvement work of this paper will be carried out around the topic sentiment model of Weibo.We introduce the word vector technology to the popular model-TSMMF(topic sentiment model based on multi-feature fusion),use multivariate Gaussian distribution to sample neighboring words fast from the word embedding space,and replace the words generated by the Dirichlet multinomial distribution.Thus,the words with low cooccurrence frequency and less information will be transformed into words with prominent topic and clear information.At the same time,the nearest neighbor search algorithm is used to further improve the running speed of the model when processing large-scale Weibo corpus,and then the GWE-TSMMF model is proposed.The experimental results show that the average F1 value of GWE-TSMMF model is about 0.718.The sentiment polarity analysis is better than the original model and the existing mainstream word embedding topic sentiment models(WS-TSWE and HST-SCW).
作者
李玉强
张伟江
黄瑜
李琳
刘爱华
LI Yu-qiang;ZHANG Wei-jiang;HUANG Yu;LI Lin;LIU Ai-hua(School of Computer Science and Technology,Wuhan University of Technology,Wuhan 430063,China;School of Energy and Power Engineering,Wuhan University of Technology,Wuhan 430063,China)
出处
《计算机科学》
CSCD
北大核心
2022年第2期256-264,共9页
Computer Science
基金
国家社会科学基金项目(15BGL048)。
关键词
主题情感模型
高斯分布
词嵌入
微博情感极性分析
Topic sentiment model
Gaussian distribution
Word embedding
Weibo sentiment polarity analysis