Lossy coding impact on speech recognition with convolutional neural networks

Kucharski, Mateusz

doi:10.21008/j.0860-6897.2022.3.02

Artykuł - szczegóły

Tytuł artykułu

Lossy coding impact on speech recognition with convolutional neural networks

Autorzy

Kucharski Mateusz

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.21008/j.0860-6897.2022.3.02

Warianty tytułu

Języki publikacji

Abstrakty

This paper presents research of lossy coding impact on speech recognition with convolutional neural networks. For this purpose, google speech commands dataset containing utterances of 30 words was encoded using four most common all-purpose codecs: mp3, aac, wma and ogg. A convolutional neural network was taught using part of the original files and later tested with the rest of the files, as well as their counterparts encoded with different codecs and bitrates. The same network model was also taught using mp3 encoded data showing the biggest loss in effectiveness of the previous network. Results show that lossy coding does have an effect on speech recognition, especially for low bitrates.

Słowa kluczowe

lossy coding convolutional neural networks speech recognition

kodowanie stratne konwolucyjne sieci neuronowe rozpoznawanie mowy

Wydawca

Poznan University of Technology. Institute of Applied Mechanics

Czasopismo

Vibrations in Physical Systems

Rocznik

2022

Tom

Vol. 33, nr 3

Strony

art. no. 2022302

Opis fizyczny

Bibliogr. 12 poz., il. kolor.

Twórcy

autor

Kucharski Mateusz

mateusz.kucharski@pwr.edu.pl

Wrocław University of Science and Technology, Department of Computer Engineering, Janiszewskiego 11/17, 50-372 Wrocław

https://orcid.org/2022302

Bibliografia

1. U. Kamath, J. Liu, J. Whitaker; Deep Learning for NLP and Speech Recognition; Springer Nature Switzerland AG 2019. DOI: 10.1007/978-3-030-14596-5
2. R.V. Pawar, P.P. Kajave, S.N. Mali; Speaker Identification using Neural Networks; Proceedings of World Academy of Science, Engineering and Technology Volume 7 August 2005
3. V. Delić, Z. Perić, M. Secujski, N. Jakovljević, J. Nikolić, D. Misković, N. Simić, S. Suzić, T. Delić; Speech Technology Progress Based on New Machine Learning Paradigm; Hindawi Computational Intelligence and Neuroscience Volume 2019. DOI: 10.1155/2019/4368036
4. O. Such, S. Barreda, a. Mojsej; A comparison of formant and CNN models for vowel frame recognition; 2019
5. H.A. Patil, A.E. Cohen, K.K. Parhi; Speaker Identification over Narrowband VoIP Networks; Forensic Speaker Recognition, Springer: New York, 2012. DOI: 10.1007/978-1-4614-0263-3_6
6. M. Kucharski, S. Brachmański; Coding Effects on Changes in Formant Frequencies in Japanese Speech Signals, Vibrations in Physical Systems 2019, 1, 30, 243-250
7. M. Kucharski, S. Brachmański; Coding effects on changes in formant frequencies in Japanese and English speech signals; EURASIP Journal on Audio, Speech and Music Processing, 2022, submitted
8. P. Warden; Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition; arXiv:1804.03209, 2018. DOI: 10.48550/arXiv.1804.03209
9. Simple audio recognition: Recognizing keywords; https://github.com/tensorflow/docs/blob/master/site/en/tutorials/audio/simple_audio.ipynb (access 28.04.2022)
10. A.B. Downey; Think DSP: Digital Signal Processing in Python; Version 1.1.4, Green Tea Press, 2014.
11. TensorFlow Core v2.9.1 API documentation for Python: https://www.tensorflow.org/api_docs/python/tf (access 15.08.2022)
12. C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger; On Calibration of Modern Neural Networks; International Conference on Machine Learning, 2017. DOI: 10.48550/arXiv.1706.04599

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b2e54dbe-cdf6-474d-9b21-38f66abfc3ea