摘要
基于AudioMAE的自监督声纹识别具有良好的泛化性且不需要大量标注数据,但在重构原始梅尔频谱图时,AudioMAE仅使用编码器最后一层的输出,而忽略了编码器浅层包含的特征信息。为了解决这个问题,本文提出一种多级特征融合策略,首先将浅层的特征经过投影层与最后一层特征进行对齐,然后使用动态权重策略融合不同层级的特征,最后将融合后的特征送到解码器进行重构。实验的结果显示,本文方法在top1分类准确率上达到了95.95%,在top5分类准确率上达到了98.44%,较原始的AudioMAE分别提升了0.68%和0.24%。
The self supervised voiceprint recognition based on AudioMAE has good generalization and does not require a large amount of annotated data.However,when reconstructing the original Mel spectrogram,AudioMAE only uses the output of the last layer of the encoder,thus ignoring the feature information contained in the shallow layers of the encoder.To address this issue,this paper proposes a multi-level feature fusion strategy.Firstly,shallow features are aligned with the last layer features through a projection layer.Then,a dynamic weight strategy is used to fuse features from different levels.Finally,the fused features are sent to a decoder for reconstruction.The experimental results show that the proposed method achieves 95.95%accuracy in the top 1 classification and 98.44%accuracy in the top 5 classification,which is 0.68%and 0.24%higher than the original AudioMAE,respectively.
作者
林泽文
郑景元
何允栋
余文敬
徐翀
LIN Zewen;ZHENG Jingyuan;HE Yundong;YU Wenjing;XU Chong(Department of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou,China,310018)
出处
《福建电脑》
2024年第10期23-27,共5页
Journal of Fujian Computer
基金
国家级大学生创新训练计划项目(No.202310336046)资助。
关键词
声纹识别
自监督学习
掩码自编码器
多级特征融合
Speaker Verification
Self-Supervised Learning
Mask Autoencoder
Multi-Level Feature Fusion