Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
The study explored the performance of vowel recognition using an acoustic model built on Audio Fingerprint techniques [1]. The research compares the performance of Support Vector Machines (SVMs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and k-Nearest Neighbours (k-NN) classifiers in the recognition of isolated and within-word vowels and investigates the importance of different types of acoustic speech features in this process. Temporal, spectral, cepstral, formant, LPC and perceptual features of speech were examined. Importance of features was tested using a random forest classifier. Vowel classification was tested at three confidence levels for feature importance: 90%, 95% and 99%. Two author databases consisting of a total of 1,200 samples from 20 speakers, recorded under household conditions, were used. The classifiers were evaluated by confusion matrix, accuracy, precision, sensitivity and F1 score. A segmentation of words into speech sounds was carried out using a tool based on BiLSTM recurrent neural networks and the BIC criterion. Three most important features were determined: power spectral density, spectral cut-off, and Power-Normalised Cepstral Coefficients. In the isolated vowel recognition task, the SVM classifier was the most effective with a feature significance confidence level of 95% obtaining accuracy = 81%, precision = 81%, sensitivity = 81%, F1 score = 80%. In the task of recognising a vowel within a word, it was verified if the algorithm detected the presence of vowels in the correct segment and if it recognised the correct vowel within it. The best results were obtained by the k-NN classifier (statistical confidence level of feature importance of 99.9%). However, these results were low, correct recognition of the vowel in the word: A, E, U: 20%, I, O: 7%, Y: 23%. This indicates strong influence of the neighbourhood of other speech sounds in speech on the acoustic model of vowels and their recognition.
Słowa kluczowe
Czasopismo
Rocznik
Tom
Strony
art. no. 2024101
Opis fizyczny
Bibliogr. 36 poz., 1 rys., 1 wykr.
Twórcy
autor
- Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warszawa
Bibliografia
- 1. P. Cano, E. Batlle, T. Kalker, J. Haitsma; A review of audio fingerprinting; 2002 IEEE Workshop on Multimedia Signal Processing, 169-173, 2003; DOI: 10.1109/MMSP.2002.1203274
- 2. M.J.F. Gales, S. Young, The application of hidden Markov models in speech recognition; Foundations and Trends in Signal Processing, 2007, 1(3), 195–304; DOI: 10.1561/2000000004
- 3. I. Steinwart, A. Christmann; Support vector machines; Wiley Interdisciplinary Reviews: Computational Statistics, 2008
- 4. J.M. Tebelskis; Speech recognition using neural networks; Carnegie Mellon University, 1995
- 5. L. Golipour. D. O’Shaughnessy; Context-independent phoneme recognition using a k-nearest neighbour classification approach; In: 2009 IEEE Int. Conf. on Acoustics, Speech and Signal Proc. IEEE, 2009, 1341–1344; DOI: 10.1109/ICASSP.2009.4959840
- 6. J. Saini, R. Mehra; Power spectral density analysis of speech signal using window techniques; International Journal of Computer Applications, 2015,131(14), 33–36
- 7. L.R.Rabiner, M.R.Sambur; An algorithm for determining the endpoints of isolated utterances; Bell System Technical Journal, 1975, 54(2), 297–315; DOI: 10.1002/j.1538-7305.1975.tb02840.x
- 8. M. Kos, Z. Kačič, D. Vlaj; Speech bandwidth classification using general acoustic features, modified spectral roll-off and artificial neural network; In: Mathematical models and methods in modern science Conf., 14th, Mathematical models and methods in modern science, 2012, 212–217
- 9. M. Kos, Z. Kačič, D. Vlaj; Acoustic classification and segmentation using modified spectral roll-off and variance-based features; Digital Signal Processing, 2013, 23(2), 659–674
- 10. R.A. Scholtz; How do you define bandwidth?; International Telemetering Conf. Proc, 1972, 8
- 11. P. Tsiakoulis, A. Potamianos, D. Dimitriadis; Spectral moment features augmented by low order cepstral coefficients for robust asr; IEEE Signal Processing Letters, 2010, 17(6), 551–554; DOI: 10.1109/LSP.2010.2046349
- 12. B. McFee, C. Raffel, D. Liang, D.P. Ellis, M. McVicar, E. Battenberg, O. Nieto; librosa: Audio and music signal analysis in Python; In: Proc. of the 14th Python in science conf., 2015, 8, 18–25
- 13. P. Virtanen et. al.; SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python; Nature Methods, 2020, 17, 261–272; DOI: 10.1038/s41592-019-0686-2
- 14. A. Gray, J. Markel; A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis; IEEE Transactions on Acoustics, Speech, and Signal Processing, 1974, 22(3), 207–217; DOI: 10.1109/TASSP.1974.1162572
- 15. J.J. Noda, C.M. Travieso-González, D. Sánchez-Rodríguez, J.B. Alonso-Hernández; Acoustic classification of singing insects based on mfcc/lfcc fusion; Applied Sciences, 2019, 9(19), 4097; DOI: 10.3390/app9194097
- 16. C. Kim and R.M. Stern; Power-normalized cepstral coefficients (pncc) for robust speech recognition; IEEE/ACM Transactions on audio, speech, and language processing, 2016, 24(7), 1315–1329; DOI: 10.1109/TASLP.2016.2545928
- 17. C.K. On, P.M. Pandiyan, S. Yaacob, and A. Saudi; Mel-frequency cepstral coefficient analysis in speech recognition; in 2006 Int. Conf. on Computing & Informatics. IEEE, 2009, 1–5; DOI: 10.1109/ICOCI.2006.5276486
- 18. U. Shrawankar, V.M. Thakare; Techniques for feature extraction in speech recognition system: A comparative study; arXiv (Cornell University), 2013; DOI: 10.48550/arXiv.1305.1145
- 19. H. Hermansky; Perceptual linear predictive (plp) analysis of speech; the Journal of the Acoustical Society of America, 1990, 87(4), 1738–1752; DOI: 10.1121/1.399423
- 20. N. Kraus, T. Nicol; Brainstem origins for cortical ‘what’and ‘where’pathways in the auditory system; Trends in neurosciences, 2005, 28(4), 176–181; DOI: 10.1016/j.tins.2005.02.003
- 21. Audacity® software is copyright © 1999-2021 audacity team. the name audacity® is a registered trademark.” accessed on: Jun. 2023, Available: https://audacityteam.org/
- 22. A toolkit to implement segmentation on speech based on bic and nerual network, such as bilstm; https://github.com/wblgers/py_speech_seg (accessed on 2023.06.20)
- 23. S. Chen, P. Gopalakrishnan et al.; Speaker, environment and channel change detection and clustering via the bayesian information criterion; In: Proc. DARPA broadcast news transcription and understanding workshop, Landsdowne Conference Resort, Landsdowne, 1998, 8, 127–132.
- 24. R. Yin, H. Bredin, C. Barras; Speaker change detection in broadcast tv using bidirectional long short-term memory networks; In: Interspeech 2017. ISCA, 2017; DOI: 10.21437/Interspeech.2017-65
- 25. F. Pedregosa et al.; Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 2011, 12, 2825–2830
- 26. M.S Sonwane, C.A Dhawale; Evaluation and analysis of few parametric and nonparametric classification methods; In 2016 Second International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 2016, 14–21; DOI: 10.1109/CICT.2016.13
- 27. C.Zhang, C. Liu, X. Zhang, G. Almpanidis; An up-to-date comparison of state-of-the-art classification algorithms; Expert Systems with Applications, 2017, 128-150; DOI 10.1016/j.eswa.2017.04.003
- 28. J. Bilmes; Gaussian models in automatic speech recognition; In: D. Havelock, S. Kuwano, M. Vorländer, Eds. Handbook of Signal Processing in Acoustics. Springer, New York, 2008 521-555; DOI :10.1007/978-0-387-30441-0_29
- 29. Y.R. Kumar, A.V. Babu, K.N. Kumar, J.S.R. Alex; Modified Viterbi decoder for HMM based speech recognition system; In 2014 Int. Conf. on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), 2014, 470–474
- 30. VM. Ilic; Entropy semiring forward-backward algorithm for HMM entropy computation; arXiv (Cornell University), 2021; DOI: 10.48550/arXiv.1108.0347
- 31. H.Lu, Y.J. Wu, K. Tokuda, L.R. Dai, R.H. Wang; Full covariance state duration modeling for HMM-based speech synthesis; In: 2009 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, IEEE, 2009, 4033-4036; DOI: 10.1109/ICASSP.2009.4960513
- 32. A. Patle, D.S. Chouhan; SVM kernel functions for classification; In: 2013 Int. Conf. on advances in technology and engineering (ICATE); IEEE, 2013, 1–9
- 33. A.Ahad, A. Fayyaz, T. Mehmood; Speech recognition using multilayer perceptron; In: IEEE Students Conf., ISCON'02. Proc., IEEE, 2022, 1, 103–109; DOI: 10.1109/ISCON.2002.1215948
- 34. S. Sharma, S, S. Sharma, A. Athaiya; Activation functions in neural networks; Int. Journal of Engineering Applied Sciences and Technology, 2020, 4(12), 310–316
- 35. X. Wu, X. R. Ward, L. Bottou; Wngrad: Learn the learning rate in gradient descent; arXiv (Cornell University), 2018; DOI: 10.48550/arXiv.1803.02865
- 36. M. Mohibullah, M.Z Hossain, M. Hasan; Comparison of euclidean distance function and manhattan distance function using k-mediods; Int. Journal of Computer Science and Information Security (IJCSIS), 2015, 13(10), 61–71
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-474852fe-b79b-4784-aff4-701d340836d0
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.