摘要
大多数现有的跨模态检索方法仅使用每个模态内的模态内关系或图像区域和文本词之间的模态间关系。文章中提出了一种基于自然语言的句法依存关系的视觉语言模型,称为Dep-ViLT。通过句法依存分析,构建句法依存树,利用单向的句法依存关系增强核心语义的特征表达,促进语言模态与视觉模态的特征交互。实验表明,Dep-ViLT对比现有的SOTA模型召回率(R@K)平均提升了1.7%,最高提升2.2%。最重要的是,Dep-ViLT在具有复杂语法结构的长难句中依然表现良好。
Most of the existing cross-modal retrieval methods only use the intra-modal relationship within each mode or the inter-modal relationship between image regions and text words.This paper proposes a visual language model based on the syntactic dependency relationship of natural language,called Dep-ViLT.Through syntactic dependency analysis,the syntactic dependency tree is constructed,and the one-directional syntactic dependency relationship is used to enhance the feature expression of core semantics and promote the feature interaction between language mode and visual mode.The experiment shows that the recall rate(R@K)of Dep-ViLT compared with the existing SOTA model has an average increase of 1.7%,with a maximum increase of 2.2%.Most importantly,the Dep-ViLT still performs well in long and difficult sentences with complex grammatical structures.
作者
张知奇
袁鑫攀
曾志高
ZHANG Zhiqi;YUAN Xinpan;ZENG Zhigao(Hunan University of Technology,Zhuzhou 412007,China)
出处
《现代信息科技》
2023年第10期74-79,共6页
Modern Information Technology
基金
2022年湖南省教育厅科学研究项目(22B0559)
2022年湖南省自然科学基金面上项目(2022JJ30231)
政府间国际科技创新合作资助(2022YFE0103700)
湖南工业大学研究生科研创新项目资助(CX2213)。