说话人感知的交叉注意力说话人提取网络

Speaker-Aware Cross Attention Speaker Extraction Network

在线阅读下载PDF

导出

摘要目标说话人提取任务的目标是在一段混合音频中提取特定说话人的语音,任务设置上一般会给一段目标说话人注册音频作为辅助信息。现有的研究工作主要有以下不足:(1)说话人识别的辅助网络无法捕获学习注册音频中的关键信息;(2)缺乏混合音频嵌入和注册音频嵌入的交互学习机制。以上不足导致了现有研究工作在注册音频和目标音频之间存在较大差异时有说话人混淆问题。为了解决该问题,提出说话人感知的交叉注意力说话人提取网络(Speaker-aware Cross Attention Speaker Extraction Network,SACAN)。SACAN在说话人识别辅助网络引入基于注意力的说话人聚合模块,有效聚合目标说话人声音特性的关键信息和利用混合音频增强目标说话人嵌入。进一步地,SACAN通过交叉注意力构建交互学习机制促进说话人嵌入与混合音频嵌入融合学习,增强了模型的说话人感知能力。实验结果表明,SACAN相比基准方法在STOI和SI-SDRi分别提高了0.013 3、1.069 5 d B,并在说话人混淆相关评估和消融实验中验证了不同模块的有效性。 Target speaker extraction aims to extract the speech of the specific speaker from mixed audio,which usually treats the enrolled audio of the target speaker as auxiliary information.Existing approaches mainly have the following limitations:the auxiliary network for speaker recognition cannot capture the critical information from enrolled audio,and the second one is the lack of an interactive learning mechanism between mixed and enrolled audio embedding.These limitations lead to speaker confusion when the difference between the enrolled and target audio is significant.To address this,a speaker-aware cross-attention speaker extraction network(SACAN) is proposed.First,SACAN introduces an attention-based speaker aggregation module in the speaker recognition auxiliary network,which effectively aggregates critical information about target speaker characteristics.Then,it uses mixed audio to enhance target speaker embedding.After that,to promote the integration of speaker embedding and mixed audio embedding,SACAN builds an interactive learning mechanism through cross-attention and enhances the speaker perception ability of the model.The experimental results show that SACAN improves by 0.013 3and 1.069 5 in terms of STOI and SI-SDRi when compared with the benchmark model,validating the effectiveness of the proposed module in speaker confusion assessment and ablation experiments.

作者李卓璋许柏炎蔡瑞初郝志峰 Li Zhuo-zhang;Xu Bo-yan;Cai Rui-chu;Hao Zhi-feng(School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China;College of Science,Shantou University,Shantou 515063,China)

机构地区广东工业大学计算机学院汕头大学理学院

出处《广东工业大学学报》 CAS 2024年第3期91-101,共11页 Journal of Guangdong University of Technology

基金科技创新2030-“新一代人工智能”重大项目(2021ZD0111501) 国家优秀青年科学基金资助项目(62122022) 国家自然科学基金资助项目(61876043,61976052,62206064)。

关键词语音分离目标说话人提取说话人嵌入交叉注意力多任务学习 speech separation target speaker extraction speaker embedding cross attention multi-task learning

分类号 TP391.2 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1吴亮,王甲祥,施汉琴,郑爱华,盛小飞.基于多尺度自适应注意力机制的视听语音分离[J].人工智能,2024(3):1-14. 被引量：1
2屠彦辉,霍伟明,高建清,王海坤,马峰,殷兵,王瑞,付中华,樊其锋.基于多模态波束方向特征的多模语音分离及识别[J].人工智能,2024(3):36-44.
3曾援,李剑,马明星,庞润嘉,贺斌.基于改进Transformer模型的多声源分离方法[J].计算机技术与发展,2024,34(5):60-65.
4薄阳瑜,刘晓晶,武永亮,王学军.基于特征聚合和传播网络的图像超分辨率重建[J].模式识别与人工智能,2024,37(4):299-312.
5香慧敏,李东亚,白涛.基于ALBERT-Seq2Seq模型的多标签农业文本分类方法[J].信息技术,2024,48(5):22-29. 被引量：1
6李文伟,郑永军,杨圣慧,江世界,赵航行,王慧,苏道毕力格,谭彧.音频技术在禽畜养殖与果蔬种植中的应用研究进展[J].农业工程学报,2024,40(7):34-49. 被引量：1
7Wanyu Luo,Yanqing Wang,Yujia Liu,Yiqin Xu.Design and Implementation of Speech Generation and Demonstration Research Based on Deep Learning[J].国际计算机前沿大会会议论文集,2023(1):475-486.
8高玉鹏,闫伟红,潘新.基于卷积神经网络与注意力机制的高光谱图像分类[J].光电子．激光,2024,35(5):483-489. 被引量：1
9Sancheng Peng,Rong Zeng,Hongzhan Liu,Lihong Cao,Guojun Wang,Jianguo Xie.Deep Broad Learning for Emotion Classification in Textual Conversations[J].Tsinghua Science and Technology,2024,29(2):481-491.
10王兴平,方煜,赵敏,Long Yunjun,Wang Dengtao,Sun Ting,Dipl.-Ing.Dita Leyh,Liu Jiaxian.“一带一路”倡议下的次区域与城市高质量发展[J].China City Planning Review,2024,33(1):4-21.

广东工业大学学报

2024年第3期

浏览历史

内容加载中请稍等...

说话人感知的交叉注意力说话人提取网络

相关作者

相关机构

相关主题

浏览历史