期刊文献+

Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems

在线阅读 下载PDF
导出
摘要 Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.
出处 《Computers, Materials & Continua》 SCIE EI 2022年第12期6407-6424,共18页 计算机、材料和连续体(英文)
基金 This work was supported by the National Natural Science Foundation of China(No.61872231) the National Key Research and Development Program of China(No.2021YFC2801000) the Major Research plan of the National Social Science Foundation of China(No.20&ZD130).

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部