期刊文献+

联合视觉分组的图像中文描述

Image caption in Chinese with vision-union grouping
在线阅读 下载PDF
导出
摘要 针对图像描述任务中使用的编码器提取图像细粒度语义特征不充分,导致模型生成的描述内容粗糙而文本细腻度不足的问题,提出了一种联合视觉分组的图像中文描述模型。模型采用编解码结构,编码阶段,使用两种方式实现图像全局语义和局部细节两类特征的提取。首先,使用对比语言图像预训练编码器提取图像的潜在语义信息;其次,结合视觉分组的思想将图像中各物体类别划分为不同规则大小的视觉片段,以此提取图像细节特征。最后。对编码器得到的两类特征进行融合,并通过映射网络转换为描述文本的前缀信息,再嵌入到语言模型中。解码阶段,使用语言模型GPT-2生成图像描述。与相关文献的模型相比,所提模型在BLEU-1到BLEU-4评价指标上分别获得了0.815、0.711、0.616和0.532,达到了最佳性能。在AIC-ICC数据集上进行仿真实验,结果表明所提出模型生成的描述文本更准确、更流畅。 To address the problem that the encoders used in the image captioning task can not extract sufficient finegrained semantic features of the giving images,which leads to coarse descriptions and insufficient textual fineness,a model of image caption in Chinese with vision-union grouping is proposed.The model belongs to encoder-decoder framework.In the encoding stage,two types’features of global semantic and local details,are extracted using two different network channels.Firstly,the potential semantic information of the image is extracted using Contrastive Language-Image Pre-Training image encoder.Secondly,by utilizing the idea of visual grouping,each image object category is divided into visual segments.Segments are the image detail which are corresponding to different regular sizes.Global and local features are fused together and then converted into prefix embeddings through a mapping network.In the decoding stage,the language model GPT-2 is employed to generate image descriptions.Compared with these Chinese image caption models available,proposed model achieved best performance,that is 0.815,0.711,0.616 and 0.532 from BLEU-1 to BLEU-4.Simulation experiments are conducted on the AIC-ICC dataset.The results show that the proposed model generates more accurate and fluent description texts.
作者 郝子娴 汪兴建 杨有 HAO Zixian;WANG Xingjian;YANG You(School of Computer and Information Science,Chongqing Normal University,Chongqing 401331,China;Chongqing Youth Vocational&Technical College,Chongqing 400712,China;National Center for Applied Mathematics in Chongqing,Chongqing Normal University,Chongqing 401331,China)
出处 《微电子学与计算机》 2024年第8期73-80,共8页 Microelectronics & Computer
基金 重庆市教委科学技术研究项目(KJZD-K202200504,KJQN-202200564) 重庆市教育科学“十四五”规划项目(2022-576)。
关键词 图像中文描述 视觉分组 特征融合 图像语义 编解码器 image captioning in Chinese visual grouping feature integration image semantics encoding and decoding
  • 相关文献

参考文献5

二级参考文献13

共引文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部