PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This work aims to further compensate for the weaknesses of feature sparsity and insufficient discriminative acoustic features in existing short-duration speaker recognition. To address this issue, we propose the Bark-scaled Gauss and the linear filter bank superposition cepstral coefficients (BGLCC), and the multidimensional central difference (MDCD) acoustic feature extracted method. The Bark-scaled Gauss filter bank focuses on low-frequency information, while linear filtering is uniformly distributed, therefore, the filter superposition can obtain more discriminative and richer acoustic features of short-duration audio signals. In addition, the multi-dimensional central difference method captures better dynamics features of speakers for improving the performance of short utterance speaker verification. Extensive experiments are conducted on short-duration text-independent speaker verification datasets generated from the VoxCeleb, SITW, and NIST SRE corpora, respectively, which contain speech samples of diverse lengths, and different scenarios. The results demonstrate that the proposed method outperforms the existing acoustic feature extraction approach by at least 10% in the test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods.
Rocznik
Strony
37--48
Opis fizyczny
Bibliogr. 31 poz., rys., tab., wyk.
Twórcy
autor
  • School of Computer and Artificial Intelligence, Wuhan University of Technology Wuhan, China
  • School of Computer and Artificial Intelligence, Wuhan University of Technology Wuhan, China
Bibliografia
  • 1. Atal B.S. (1974), Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, The Journal of the Acoustical Society of America, 55(6): 1304-1312, doi: 10.1121/1.1914702.
  • 2. Bai Z., Zhang X.-L., Chen J. (2020), Speaker verification by partial AUC optimization with Mahalanobis distance metric learning, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1533-1548, doi: 10.1109/TASLP.2020.2990275.
  • 3. Campbell J.P. (1997), Speaker recognition: A tutorial, Proceedings of the IEEE, 85(9): 1437-1462, doi: 10.1109/5.628714.
  • 4. Chowdhury A., Ross A. (2020), Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals, IEEE Transactions on Information Forensics and Security, 15: 1616-1629, doi: 10.1109/TIFS.2019.2941773.
  • 5. Chung J.S., Nagrani A., Zisserman A. (2018), Voxceleb2: Deep speaker recognition, [in:] Proceedings of Interspeech 2018, pp. 1086-1090, doi: 10.21437/Interspeech.2018-1929.
  • 6. Das R.K., Mahadeva Prasanna S.R. (2016), Exploring different attributes of source information for speaker verification with limited test data, The Journal of the Acoustical Society of America, 140(1): 184, doi: 10.1121/1.4954653.
  • 7. Dehak N., Dehak R., Glass J., Reynolds D., Kenny P. (2010), Cosine similarity scoring without score normalization techniques, [in:] Proceedings of Odyssey 2010 - The Speaker and Language Recognition Workshop.
  • 8. Desplanques B., Thienpondt J., Demuynck K. (2020), ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, [in:] Proceedings of Annual conference of the International Speech Communication Association 2020, pp. 3830-3834, doi: 10.21437/Interspeech.2020-2650.
  • 9. Greenberg C.S. et al. (2013), The 2012 NIST speaker recognition evaluation, [in:] Proceedings of Interspeech 2013, pp. 1971-1975, doi: 10.21437/Interspeech.2013-469.
  • 10. Herrera-Camacho A., Zúñiga-Sainos A., Sierra-Martínez G., Tramgol-Curipe J., Mota-Montoya M., Jarquín-Casas A. (2019), Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE, [in:] Proceedings of International Conference on Video, Signal and Image Processing 2019, pp. 105-110, doi: 10.1145/3369318.3369330.
  • 11. Huang L., Pun C.-M. (2020), Audio replay spoof attack detection by joint segment-based linear filter bank feature extraction and attention-enhanced DenseNet-BiLSTM network, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1813-1825, doi: 10.1109/TASLP.2020.2998870.
  • 12. Kenny P., Boulianne G., Dumouchel P. (2005), Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, 13(3): 345-354, doi: 10.1109/TSA.2004.840940.
  • 13. Kinnunen T., Li H. (2010), An overview of text-independent speaker recognition: From features to supervectors, Speech Communication, 52(1): 12-40, doi: 10.1016/j.specom.2009.08.009.
  • 14. Liu Z., Wu Z., Li T., Li J., Shen C. (2018), GMM and CNN hybrid method for short utterance speaker recognition, IEEE Transactions on Industrial Informatics, 14(7): 3244-3252, doi: 10.1109/TII.2018.2799928.
  • 15. Martin A.F., Greenberg C.S. (2010), The NIST 2010 speaker recognition evaluation, [in:] Proceedings of Interspeech 2010, pp. 2726-2729, doi: 10.21437/Interspeech.2010-722.
  • 16. McLaren M., Ferrer L., Castan D., Lawson A. (2016), The speakers in the wild (SITW) speaker recognition database, [in:] Proceedings of Interspeech 2016, pp. 818-822, doi: 10.21437/Interspeech.2016-1129.
  • 17. Nagrani A., Chung J.S., Zisserman A. (2017), Vox-Celeb: A large-scale speaker identification dataset, [in:] Proceedings of Interspeech 2017, pp. 2616-2620, doi: 10.21437/Interspeech.2017-950.
  • 18. Nosratighods M., Ambikairajah E., Epps J., Carey M.J. (2010), A segment selection technique for speaker verification, Speech Communication, 52(9): 753-761, doi: 10.1016/j.specom.2010.04.007.
  • 19. Omar M.K., Pelecanos J.W. (2010), Training universal background models for speaker recognition, [in:] Proceedings of Odyssey 2010 - The Speaker and Language Recognition Workshop, pp. 52-57.
  • 20. Paseddula C., Gangashetty S.V. (2018), DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC, [in:] Proceedings of International Conference on Industrial and Information Systems 2018, pp. 18-21, doi: 10.1109/ICIINFS.2018.8721379.
  • 21. Paszke A. et al. (2017), Automatic differentiation in PyTorch, [in:] Proceedings of NIPS 2017 Workshop, pp. 1-4.
  • 22. Povey D. et al. (2018), Semi-orthogonal low-rank matrix factorization for deep neural networks, [in:] Proceedings of Interspeech 2018, pp. 3743-3747, doi: 10.21437/Interspeech.2018-1417.
  • 23. Schroff F., Kalenichenko D., Philbin J. (2015), FaceNet: A unified embedding for face recognition and clustering, [in:] Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2015, pp. 815-823, doi: 10.1109/CVPR.2015.7298682.
  • 24. Snyder D., Garcia-Romero D., Sell G., McCree A., Povey D., Khudanpur S. (2019), Speaker recognition for multi-speaker conversations using x-vectors, [in:] Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2019, pp. 5796-5800, doi: 10.1109/ICASSP.2019.8683760.
  • 25. Todisco M., Delgado H., Evans N. (2017), Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification, Computer Speech & Language, 45: 516-535, doi: 10.1016/j.csl.2017.01.001.
  • 26. Villalba J. et al. (2020), State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations, Computer Speech & Language, 60: 101026, doi: 10.1016/j.csl.2019.101026.
  • 27. Vogt R., Sridharan S., Mason M. (2010), Making confident speaker verification decisions with minimal speech, IEEE Transactions on Audio, Speech, and Language Processing, 18(6): 1182-1192, doi: 10.1109/ TASL.2009.2031505.
  • 28. Wu Z., Yu Z., Yuan J., Zhang J. (2016), A twice face recognition algorithm, Soft Computing - A Fusion of Foundations, Methodologies and Applications, 20(3): 1007-1019, doi: 10.1007/s00500-014-1561-9.
  • 29. Yang H., Deng Y., Zhao H.-A. (2019), A comparison of MFCC and LPCC with deep learning for speaker recognition, [in:] Proceedings of International Conference on Big Data and Computing 2019, pp. 160-164, doi: 10.1145/3335484.3335528.
  • 30. Zhang C., Koishida K., Hansen J.H.L. (2018), Text-independent speaker verification based on triplet convolutional neural network embeddings, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9): 1633-1644, doi: 10.1109/TASLP.2018. 2831456.
  • 31. Zinchenko K., Wu C.-Y., Song K.-T. (2017), A study on speech recognition control for a surgical robot, IEEE Transactions on Industrial Informatics, 13(2): 607-615, doi: 10.1109/TII.2016.2625818.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-20da0485-3477-41ff-aaf8-a455f3e033da
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.