Preliminary Evaluation of Convolutional Neural Network Acoustic Model for Iban Language Using NVIDIA NeMo

Michael, Steve Olsen; Juan, Sarah Samson; Mit, Edwin

doi:10.26636/jtit.2022.156121

Artykuł - szczegóły

Tytuł artykułu

Preliminary Evaluation of Convolutional Neural Network Acoustic Model for Iban Language Using NVIDIA NeMo

Autorzy

Michael Steve Olsen , Juan Sarah Samson , Mit Edwin

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.26636/jtit.2022.156121

Warianty tytułu

Języki publikacji

Abstrakty

For the past few years, artificial neural networks (ANNs) have been one of the most common solutions relied upon while developing automated speech recognition (ASR) acoustic models. There are several variants of ANNs, such as deep neural networks (DNNs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs). A CNN model is widely used as a method for improving image processing performance. In recent years, CNNs have also been utilized in ASR techniques, and this paper investigates the preliminary result of an end-to-end CNN-based ASR using NVIDIA NeMo on the Iban corpus, an under-resourced language. Studies have shown that CNNs have also managed to produce excellent word error (WER) rates for the acoustic model on ASR for speech data. Conversely, results and studies concerned with under-resourced languages remain unsatisfactory. Hence, by using NVIDIA NeMo, a new ASR engine developed by NVIDIA, the viability and the potential of this alternative approach are evaluated in this paper. Two experiments were conducted: the number of resources used in the works of our ASR’s training was manipulated, as was the internal parameter of the engine used, namely the epochs. The results of those experiments are then analyzed and compared with the results shown in existing papers.

Słowa kluczowe

acoustic modeling automatic speech recognition convolutional neural network CNN under-resourced language NVIDIA NeMo

Wydawca

Instytut Łączności - Państwowy Instytut Badawczy

Czasopismo

Journal of Telecommunications and Information Technology

Rocznik

2022

Tom

nr 1

Strony

43--53

Opis fizyczny

Bibliogr. 22 poz., rys., tab.

Twórcy

autor

Michael Steve Olsen

steveolsen.ms@gmail.com

Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia

https://orcid.org/0000-0002-2510-5736

autor

Juan Sarah Samson

sjs ora@unimas.my

Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia

https://orcid.org/0000-0002-9590-1666

autor

Mit Edwin

edwin@unimas.my

Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia

https://orcid.org/0000-0002-1670-8586

Bibliografia

[1] V. Passricha and R. Aggarwal, „Convolutional neural networks for raw speech recognition", IntechOpen., vol. 32, pp. 137-144, 2013 (DOI:10.5772/intechopen.80026).
[2] E. Chuangsuwanich, „Multilingual techniques for low resource automatic speech recognition", Ph.D. thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016 [Online]. Available: http://hdl.handle.net/1721.1/105571
[3] B. Pulugundla et al., „BUT system for low resource Indian language ASR", in Proc. 19th Ann. Conf. of the Int. Speech Commun. Assoc. Interspeech 2018, Hyderabad, India, 2018, pp. 3182-3186 (ISSN: 1990-9772).
[4] O. Mamyrbayev et al., „Voice identification using classification algorithms", in Intelligent System and Computing, Yang Yi, Ed. IntechOpen, 2020 (DOI: 10.5772/intechopen.88239).
[5] J. Li et al., „Jasper: An end-to-end convolutional neural acoustic model", in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc., Interspeech 2019, Graz, Austria, 2019, pp. 71-75, 2019 (DOI:10.21437/Interspeech.2019-1819).
[6] W. Han et al., „ContextNet: Improving convolutional neural networks for automatic speech recognition with global context", in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc., Interspeech 2020, vol. 2020-Octob, pp. 3610-3614, 2020 (DOI: 10.21437/interspeech.2020-2059) [Online]. Available: https://arxiv.org/pdf/2005.03191.pdf
[7] A. Biswas, F. D. Wet, E. V. D. Weisthuizen, E. Yilmaz, and T. Niesler, „Multilingual neural network acoustic modelling for ASR of under-resourced English-Isizulu code-switched speech", in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc., Interspeech 2018, Hyderabad, India, 2018, pp. 2603-2607 (DOI: 10.21437/Interspeech.2018-1711).
[8] D. He, B. P. Lim, X. Yang, M. Hagesawa-Johnson, and D. Chen, „Improved ASR for under-resourced languages through multi-task learning with acoustic landmarks", Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc., Interspeech 2018, Hyderabad, India, 2018, pp. 2618-2622 (DOI:10.21437/Interspeech.2018-1124).
[9] D. Palaz, R. Collobert, and M. Magimai-Doss, „Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks", in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc., Interspeech 2013, Lyon, France, 2013, pp. 1766-1770 [Online]. Available: https://arxiv.org/pdf/1304.1018
[10] D. Palaz, M. Magimai-Doss, and R. Collobert, „Convolutional neural networks-based continuous speech recognition using raw speech signal", in Proc. IEEE Int. Con. on Acoust., Speech and Sig. Process. ICASSP 2015, South Brisbane, QLD, Australia, 2015, pp. 4295-4299 (DOI: 10.1109/ICASSP.2015.7178781).
[11] F. Reyes, A. Fajardo, and A. Hernandez, „Convolutional neural network for automatic speech recognition of Filipino language", Int. J. of Adv. Trends in Comp. Sci. and Engin., vol. 9, no. 1.1, pp. 34-40, 2020 (DOI:10.30534/ijatcse/2020/0791.12020).
[12] B. Thai, R. Jimerson, R. Ptucha, and E. Prud'hommeaux, „Fully convolutional ASR for less-resourced endangered languages", in Proc. of the 1st Joint Worksh. on Spok. Language Technol. for Under-res. Lang. (SLTU) and Collab. and Comput. for Under-Resourced Lang. (CCURL), Marseille, France, 2020, pp. 126-130 [Online]. Available: https://aclanthology.org/2020.sltu-1.17.pdf
[13] A. N. Mon, „Myanmar language continuous speech recognition using convolutional neural network (CNN)", Ph.D. thesis, University of Computer Studies, Yangon, 2019, pp. 87-88 [Online]. Available: https://meral.edu.mm/record/4316/files/AyeNyeinMonThesisBook.pdf
[14] K. R. Lekshmi and E. Sherly, „An acoustic model and linguistic analysis for Malayalam disyllabic words: a low resource language", Int. J. of Speech Technol., vol. 24, pp. 483-495, 2021 (DOI: 10.1007/s10772-021-09807-1).
[15] R. Collobert, C. Puhrsch, and G. Synnaeve, „Wav2Letter: an End-to-End ConvNet-based Speech Recognition System", arXiv:1609.03193v2, 2016.
[16] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Upper Saddle River, NJ: Prentice-Hall, 1993 (ISSN: 9780130151575).
[17] S. S. Juan, „Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia", PhD. thesis, Universitfie Grenoble Alpes, France, 2015, pp. 115-118 [Online]. Available: https://tel.archives-ouvertes.fr/tel-1314120/document
[18] S. Saha, „A Comprehensive Guide to Convolutional Neural Networks - the ELI5 way", Towards Data Science, 2018 [Online]. Available: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
[19] Ujjwal Karn, „An intuitive explanation of convolutional neural networks", the data science blog, 2016 [Online]. Available: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets
[20] J. Brownlee, „A gentle introduction to the rectified linear unit (ReLU)", Machine Learning Mastery, 2010 [Online]. Available: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/, 2020.
[21] „NVIDIA Deep Learning NeMo Documentation", Nvidia website, 2021 [Online]. Available: https://docs.nvidia.com/deeplearning/nemo/index.html
[22] S. S. Juan, L. Besacier, B. Lecouteux, and M. Dyab, „Using resources from a closely-related language to develop ASR for a very under-resourced language: A case study for Iban", in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc., Interspeech 2015, Dresden, Germany, 2015 (DOI: 10.21437/Interspeech.2015-318).

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-895415ac-ac9d-412d-a072-a91d8d9a3dc0