A system dedicated to Polish automatic speech recognition – overview of solutions

Pondel-Sycz, Karolina; Bilski, Piotr

doi:10.24425/bpasts.2024.149818

Artykuł - szczegóły

Tytuł artykułu

A system dedicated to Polish automatic speech recognition – overview of solutions

Autorzy

Pondel-Sycz Karolina , Bilski Piotr

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.24425/bpasts.2024.149818

Warianty tytułu

Języki publikacji

Abstrakty

The paper presents the analysis of modern Artificial Intelligence algorithms for the automated system supporting human beings during their conversation in Polish language. Their task is to perform Automatic Speech Recognition (ASR) and process it further, for instance fill the computer-based form or perform the Natural Language Processing (NLP) to assign the conversation to one of predefined categories. The State-of-the-Art review is required to select the optimal set of tools to process speech in the difficult conditions, which degrade accuracy of ASR. The paper presents the top-level architecture of the system applicable for the task. Characteristics of Polish language are discussed. Next, existing ASR solutions and architectures with the End-To-End (E2E) deep neural network (DNN) based ASR models are presented in detail. Differences between Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) and Transformers in the context of ASR technology are also discussed.

Słowa kluczowe

automatic speech recognition deep neural network transformer conformer

automatyczne rozpoznawanie mowy głębokie sieci neuronowe transformator konformer

Wydawca

Polska Akademia Nauk, Wydział IV Nauk Technicznych

Czasopismo

Bulletin of the Polish Academy of Sciences. Technical Sciences

Rocznik

2024

Tom

Vol. 72, nr 4

Strony

art. no. e149818

Opis fizyczny

Bibliogr. 86 poz., rys., tab.

Twórcy

autor

Pondel-Sycz Karolina

karolina.sycz@pw.edu.pl

The Faculty of Electronics and Information Technology on Warsaw University of Technology, Nowowiejska 15/19 Av., 00-665 Warsaw, Poland

https://orcid.org/0000-0002-0219-3649

autor

Bilski Piotr

The Faculty of Electronics and Information Technology on Warsaw University of Technology, Nowowiejska 15/19 Av., 00-665 Warsaw, Poland

https://orcid.org/0000-0002-5463-9411

Bibliografia

[1] J. Meyer, L. Dentel, and F. Meunier, “Speech recognition in natural background noise,” PLOS ONE, vol. 8, no. 11, p. e79279, Nov. 2013, doi: 10.1371/journal.pone.0079279.
[2] H.K. Kim and R.C. Rose, “Speech recognition over mobile networks,” in Automatic Speech Recognition on Mobile Devices and over Communication Networks. London: Springer London, 2008, no. 1, pp. 41–61, doi: 10.1007/978-1-84800-143-5_3.
[3] R.S. Chavan and G.S. Sable, “An overview of speech recognition using hmm,” Int. J. Comput. Sci. Mob. Comput., vol. 2, no. 6, pp. 233–238, Jun. 2013.
[4] S. Furui, “50 years of progress in speech and speaker recognition research,” ECTI Trans. Comput. Inf. Technol., vol. 1, no. 2, pp. 64–74, Jan. 1970, doi: 10.37936/ecti-cit.200512.51834.
[5] G.E. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012, doi: 10.1109/msp.2012.2205597.
[6] S.Wang and G. Li, “Overviewof end-to-end speech recognition,” J. Phys., vol. 1187, no. 5, p. 052068, Apr. 2019, doi: 10.1088/1742-6596/1187/5/052068.
[7] “Speech-to-Text: Automatic Speech Recognition | Google Cloud,” (2023). [Online]. Available: https://cloud.google.com/speech-to-text (Accessed 2024-03-13).
[8] M.J.F. Gales and S. Young, “The application of hidden Markov models in speech recognition,” Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, Jan. 2007, doi: 10.1561/2000000004.
[9] J. Li, “Recent advances in End-to-End automatic speech recognition,” APSIPA Trans. Signal Inf. Proc., vol. 11, no. 1, Jan. 2022, doi: 10.1561/116.00000050.
[10] P. Rybak, R. Mroczkowski, J. Tracz, and I. Gawlik, “KLEJ: Comprehensive benchmark for Polish language understanding,” in Proc. 58th Annu. Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 5-10 Jul. 2020, pp. 1191–1201, doi: 10.18653/v1/2020.aclmain.111.
[11] A. Magueresse, V. Carles, and E. Heetderks, “Low-resource Languages: Areviewof pastwork and future challenges,” arXiv (Cornell University), Jun. 2020, doi: 10.48550/arxiv.2006.07264.
[12] E. Trentin and M. Gori, “A survey of hybrid ANN/HMM models for automatic speech recognition,” Neurocomputing, vol. 37, no. 1-4, pp. 91–126, Apr. 2001, doi: 10.1016/s0925-2312(00)00308-8.
[13] M.A. Mazumder and R.A. Salam, “Feature Extraction Techniques for Speech Processing: A Review,” Int. J. Adv. Trends. Comput. Sci. Eng., vol. 8, no. 1.3, pp. 285–292, 2019.
[14] A. Zygadło, A. Janicki, and P. Dąbek, “Automatic speech recognition system for Polish dedicated for a social robot,” PAR. Pomiary Automatyka Robotyka, vol. 4/2016, pp. 27–36, Dec. 2016, doi: 10.14313/par_222/27.
[15] A. Janicki and D.Wawer, “Automatic speech recognition for Polish in a computer game interface,” in 2011 Federated Conf. on Computer Science and Information Systems (FedCSIS). Szczecin, Poland: IEEE, 18-21 Sep. 2011, pp. 711–716. [Online]. Available: https://ieeexplore.ieee.org/document/6078265
[16] L. Pawlaczyk and P. Bosky, “Skrybot – a system for automatic speech recognition of Polish language,” in Man-Machine Interactions. Berlin, Heidelberg: Springer, Jan. 2009, vol. 59, pp. 381–387, doi: 10.1007/978-3-642-00563-3_40.
[17] J. Jamrozy, M. Lange, M. Owsianny, and M. Szymanski, “Arm-1: Automatic speech recognition engine,” in Proc. PolEval 2019 Workshops.Warsaw, Poland: Institute of Computer Science, Polish Academy of Sciences, 2019, pp. 79–88. [Online]. Available: http://poleval.pl/files/poleval2019.pdf
[18] R. Cecko, J. Jamroż, W. Jęśko, E. Kuśmierek, M. Lange, and M. Owsianny, “Automatic Speech Recognition and its Application to Media Monitoring,” Computational Methods in Science and Technology, vol. 27, no. 2, pp. 41–55, Nov. 2021, doi: 10.12921/cmst.2021.0000015.
[19] W. Majewski, H.B. Rothman, and H. Hollien, “Acoustic comparisons of American English and Polish,” J. Phon., vol. 5, no. 3, pp. 247–251, Jul. 1977, doi: 10.1016/s0095-4470(19)31138-6.
[20] W. Majewski, H. Hollien, and J. Zalewski, “Speaking fundamental frequency of Polish adult males,” Phonetica, vol. 25, no. 2, pp. 119–125, Mar. 1972, doi: 10.1159/000259375.
[21] G. Demenko et al., “Development of large vocabulary continuous speech recognition for Polish,” Acta Phys. Pol. A, Jan. 2012, doi: 10.12693/aphyspola.121.a-86.
[22] J.C. Wells, “Computer-coding the IPA: a proposed extension of SAMPA,” (1995). [Online]. Available: https://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf (Accessed 2024-03-13).
[23] R.F. Feldstein, A concise Polish grammar. Slavic and East European Language Research Center (SEELRC), Duke University, 2001.
[24] E. Rudnicka, M. Maziarz, M. Piasecki, and S. Szpakowicz, “A strategy of mapping Polish wordnet onto princeton wordnet,” in Proc. 24th COLING 2012: Posters. Mumbai, India: COLING 2012, 8-15 Dec. 2012, pp. 1039–1048.
[25] G. Rehm and H. Uszkoreit, The Polish language in the digital age. Berlin, Heidelberg: Springer, Jan. 2012, doi: 10.1007/978-3-642-30811-6.
[26] A. Pohl and B. Ziółko, “Using part of speech n-grams for improving automatic speech recognition of Polish,” in Machine Learning and Data Mining in Pattern Recognition. Proc. of 9th Int. Conf., MLDM 2013. New York, NY, USA: Springer Berlin, Heidelberg, 19-25 Jul. 2013, pp. 492–504, doi: 10.1007/978-3-642-39712-7_38.
[27] K. Marasek, L. Brocki, D. Korzinek, K.Wołk, and R. Gubrynowicz, “Spoken language translation for Polish,” arXiv (Cornell University), Nov. 2015, doi: 10.48550/arXiv.1511.07788.
[28] R. Rosenfeld, “Two decades of statistical language modeling: where do we go from here?” Proc. of the IEEE, vol. 88, no. 8, pp. 1270–1278, Aug. 2000, doi: 10.1109/5.880083.
[29] J. Nouza, P. Cerva, and R. Safarik, “Cross-Lingual adaptation of broadcast transcription system to Polish language using public data sources,” in Human Language Technology. Challenges for Computer Science and Linguistics. 7th Language and Technology Conf., LTC 2015. Poznań, Poland: Springer Cham, 27-29 Nov. 2015, pp. 31–41, doi: 10.1007/978-3-319-93782-3_3.
[30] B. Ziółko, S. Manandhar, R. C.Wilson, M. Ziółko, and J. Galka, “Application of HTK to the Polish language,” in ICALIP 2008 Int. Conf. on Audio, Language and Image Processing. Shanghai, China: IEEE, 7-9 Jul. 2008, pp. 31–41, doi: 10.1109/ICALIP.2008.4590266.
[31] J. Nouza, R. Safarik, and P. Cerva, “ASR for South Slavic Languages Developed in Almost Automated Way.” in INTERSPEECH 2016. San Francisco, USA: ISCA Speech, 8-12 Sep. 2016, pp. 3868–3872, doi: 10.21437/Interspeech.2016-747.
[32] B. Ziółko, P. Żelasko, I. Gawlik, T. Pędzimąż, and T. Jadczyk, “An Application for Building a Polish Telephone Speech Corpus.” in LREC 2018, 11th Int. Conf. on Language Resource and Evaluation, I. Calzolari, Ed. Miyazaki, Japan: ELRA, 7-12 May 2018, pp. 429–433.
[33] S. Grocholewski, “CORPORA – speech database for Polish diphones,” in Proc. 5th European Conf. on Speech Communication and Technology (Eurospeech 1997). Rhodes, Greece: ISCA Speech, 22-25 Sep. 1997, pp. 1735–1738, doi: 10.21437/Eurospeech.1997-492.
[34] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” in INTERSPEECH 2020. Shanghai, China: ISCA Speech, 25-29 Oct. 2020, pp. 2757–2761, doi: 10.21437/Interspeech.2020-2826.
[35] R. Ardila et al., “Common Voice: a Massively-Multilingual Speech Corpus,” in Proc. of the 12th Conf. on Language Resources and Evaluation (LREC 2020). Marseille, France: ELRA, 11–16 May 2020, pp. 4218–4222. [Online]. Available: https://www.aclweb.org/anthology/2020.lrec-1.520.pdf
[36] “Common Voice open source, multi-language dataset of voices,” (2023). [Online]. Available: https://commonvoice.mozilla.org/en/datasets (Accessed 2023-03-23).
[37] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane, QLD, Australia: IEEE, 19-24 Apr. 2015, p. 5206–5210, doi: 10.1109/ICASSP.2015.7178964.
[38] J. Gu, Z. Lu, H. Li, and V. O. Li, “Incorporating copying mechanism in sequence-to-sequence learning,” in Proc. 54th Annu. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers. Berlin, Germany: Association for Computational Linguistics, 7-12 Aug. 2016, pp. 1631–1640, doi: 10.18653/v1/P16-1154.
[39] O. Vinyals, S.V. Ravuri, and D. Povey, “Revisiting recurrent neural networks for robust asr,” in 2012 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan: IEEE, 25-30 Mar. 2012, pp. 4085–4088, doi: 10.1109/ICASSP.2012.6288816.
[40] “Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin,” (2023). [Online]. Available: https://web.stanford.edu/~jurafsky/slp3/ (Accessed 2023-07-30).
[41] M.W. Kadous et al., “Temporal classification: Extending the classification paradigm to multivariate time series,” Ph.D. dissertation, The University of New South Wales, 2002.
[42] A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, ser. Studies in computational intelligence. Berlin, Heidelberg: Springer, 2012, doi: 10.1007/978-3-642-24797-2.
[43] A.N. Shewalkar, “Comparison of rnn, lstm and gru on speech recognition data,” Master’s thesis, North Dakota State University, 2018.
[44] A. Graves, A.R Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing. Vancouver, BC, Canada: IEEE, March 2013, pp. 6645–6649, doi: 10.1109/ICASSP.2013.6638947.
[45] D. Hu, “An introductory survey on attention mechanisms in nlp problems,” in Intelligent Systems and Applications. Proc. of the 2022 Intelligent Systems Conf. (IntelliSys), Y. Bi, R. Bhatia, and S. Kapoor, Eds. Amsterdam, Netherlands: Springer, 1-2 Sep. 2020, pp. 432–448, doi: 10.1007/978-3-030-29513-4_31.
[46] X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai, “An empirical study of spatial attention mechanisms in deep networks,” in Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision (ICCV). Seoul, Korea (South): IEEE Computer Society, 27 Oct.–2 Nov. 2019, pp. 46 687–6696, doi: 10.1109/ICCV.2019.00679.
[47] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” arXiv (Cornell University), Nov. 2015, doi: 10.48550/arXiv.1511.08458.
[48] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and R. Collobert, “Fully convolutional speech recognition,” arXiv (Cornell University), Dec. 2018, doi: 10.48550/arXiv.1812.06864.
[49] J. Li et al., “Jasper: An end-to-end convolutional neural acoustic model,” in INTERSPEECH 2019. Graz, Austria: ISCA Speech, 15–19 Sep. 2019, pp. 71–75, doi: 10.21437/Interspeech.2019-1819. [Online]. Available: https://www.isca-speech.org/archive/interspeech_2019/li19_interspeech.html
[50] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C.L.Y. Bengio, and A. Courville, “Towards end-to-end speech recognition with deep convolutional neural networks,” arXiv (Cornell University), Jan. 2017, doi: 10.48550/arXiv.1701.02720.
[51] W. Song and J. Cai, “End-to-end deep neural network for automatic speech recognition,” Stanford CS224D Reports, 2015. [Online]. Available: http://cs224d.stanford.edu/reports/SongWilliam.pdf
[52] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv (Cornell University), Jul. 2016, doi: 10.48550/arXiv.1607.08022.
[53] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30 (NIPS 2017). Proc. of 31st Annu. Conf. on Neural Information Processing Systems, U. Von Luxburg, Ed. Long Beach, CA, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), Dec. 2017, pp. 5999–6010. [Online]. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need
[54] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, 6-11 Jun. 2021, pp. 21–25, doi: 10.1109/ICASSP39728.2021.9413901.
[55] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB, Canada: IEEE, 15–20 Apr. 2018, pp. 5884–5888, doi: 10.1109/ICASSP.2018.8462506.
[56] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Towards online end-to-end transformer automatic speech recognition,” arXiv (Cornell University), Oct. 2019, doi: 10.48550/arXiv. 1910.11871.
[57] A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in INTERSPEECH 2020. Shanghai, China: ISCA Speech, 25-29 Oct. 2020, pp. 5036–5040, doi: 10.21437/Interspeech.2020-3015.
[58] C.C. Chiu et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB, Canada: IEEE, 15-20 Apr. 2018, pp. 4774–4778, doi: 10.1109/ICASSP.2018.8462105.
[59] D.S. Park,W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, and Q.V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in INTERSPEECH 2019. Graz, Austria: ISCA Speech, 15-19 Sep. 2019, pp. 2613–2617, doi: 10.21437/Interspeech.2019-2680.
[60] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014. [Online]. Available: https://jmlr.org/papers/v15/srivastava14a.html
[61] J. Xu, H. Kim, T. Rainforth, and Y. Teh, “Group equivariant subsampling,” in Advances in Neural Information Processing Systems (NEURLPS 2021), vol. 34. Curran Associates, Inc., 2021, pp. 5934–5946. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/2ea6241cf767c279cf1e80a790df1885-Paper.pdf
[62] M. Ogrodniczuk, “Polish parliamentary corpus,” in Proc. LREC 2018 Workshop “ParlaCLARIN: Creating and Using Parliamentary Corpora,” F. d. J. Darja Fišer, Maria Eskevich, Ed., Miyazaki, Japan, 7 May 2018, pp. 15–19. [Online]. Available: http://lrec-conf.org/workshops/lrec2018/W2/pdf/book_of_proceedings.pdf
[63] J.C. Vásquez-Correa and A. Álvarez Muniain, “Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2. 0 vs. Whisper,” Sensors, vol. 23, no. 4, p. 1843, Feb. 2023, doi: 10.3390/s23041843.
[64] J. Iranzo-Sánchez et al., “Europarl-st: A multilingual corpus for speech translation of parliamentary debates,” in 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 4-8 May 2020, pp. 8229–8233, doi: 10.1109/ICASSP40776.2020.9054626.
[65] C. Wang et al., “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation,” arXiv (Cornell University), Jan. 2021, doi: 10.48550/arXiv.2101.00390.
[66] “STT Pl Quartznet15x5: Speech To Text (STT) model based on QuartzNet for recognizing Polish speech,” (2023). [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_pl_quartznet15x5 (Accessed 2023-07-31).
[67] “NVIDIA FastConformer-Hybrid Large (pl): Polish Fast-Conformer Hybrid (Transducer and CTC) Large model with Punctuation and Capitalization,” (2023). [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_pl_fastconformer_hybrid_large_pc (Accessed 2023-09-12).
[68] P. Guo et al., “Recent developments on espnet toolkit boosted by conformer,” in 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, 6-11 Jun. 2021, pp. 5874–5878, doi: 10.1109/ICASSP39728.2021.9414858.
[69] “Whisper: Github repository,” (2023). [Online]. Available: https://github.com/openai/whisper (Accessed 2023-06-25).
[70] A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in INTERSPEECH 2022. Incheon, Kore: ISCA Speech, 18-22 Sep. 2022, pp. 2278–2282, doi: 10.21437/Interspeech.2022-14. [Online]. Available: https://www.isca-speech.org/archive/pdfs/interspeech_2022/babu22_interspeech.pdf
[71] N.Q. Pham, A. Waibel, and J. Niehues, “Adaptive multilingual speech recognition with pretrained models,” in 18-22 INTERSPEECH 2022. Incheon, Kore: ISCA Speech, Sep. 2022, pp. 3879–3883, doi: 10.21437/Interspeech.2022-872.
[72] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in INTERSPEECH 2021. Brno, Czechia: ISCA Speech, 30 Aug.–3 Sep. 2021, pp. 2426–2430, doi: 10.21437/Interspeech.2021-329.
[73] A. Radford, J.W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. of the 40th Int. Conf. on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. Honolulu, HI, United States: PMLR, 23–29 Jul 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
[74] S. Kriman et al., “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 4-8 May 2020, pp. 6124–6128. [Online]. Available: https://ieeexplore.ieee.org/xpl/conhome/9040208/proceeding?isnumber=9052899&sortType=vol-only-seq&searchWithin=Quartznet
[75] “NVIDIA NeMo: conversational AI toolkit,” (2023). [Online]. Available: https://github.com/NVIDIA/NeMo (Accessed 2023-06-23).
[76] J. Huang et al., “Cross-language transfer learning, continuous learning, and domain adaptation for end-to-end automatic speech recognition,” in 2021 IEEE Int. Conf. on Multimedia and Expo (ICME). Shenzhen, China: IEEE, 05-09 Jul. 2021, pp. 1–6, doi: 10.1109/ICME51207.2021.9428334.
[77] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of 32nd Int. Conf. on Machine Learning (ICML 2015), D.B. Francis Bach, Ed., vol. 37. Lille, France: PMLR, 6–11 Jul. 2015, pp. 448–456. [Online]. Available: https://proceedings.mlr.press/v37/
[78] A.F. Agarap, “Deep learning using rectified linear units (relu),” arXiv (Cornell University), Jan. 2018, doi: 10.48550/arXiv.1803.08375.
[79] D. Rekesh et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv (CornellUniversity), May 2023, doi: 10.48550/arXiv.2305.05084.
[80] “NVIDIA Hybrid-Transducer-CTC models,” (2023). [Online]. Available: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc (Accessed 2023-07-25).
[81] “Google SentencePiece Unigram: Github repository,” (2023). [Online]. Available: https://github.com/google/sentencepiece (Accessed 2024-01-23).
[82] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances inNeural Information Processing Systems 33: Annu. Conf. on Neural Information Processing Systems (NeurIPS 2020), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds. Curran Associates, Inc., 6-12 Dec. 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
[83] P. Roach et al., “Babel: An eastern european multi-language database,” in Proc. of 4th Int. Conf. on Spoken Language Processing (ICSLP 96), vol. 3, Philadelphia, PA, USA, 3-6 Oct. 1996, pp. 1892–1893, doi: 10.21437/ICSLP.1996. [Online]. Available: https://www.isca-speech.org/archive/icslp_1996/roach96_icslp.html
[84] S. Watanabe et al., “Espnet: End-to-end speech processing toolkit,” in INTERSPEECH 2018. Hyderabad, India: ISCA Speech, 2-6 Sep. 2018, pp. 2207–2211, doi: 10.21437/Interspeech.2018-1456.
[85] “ESPnet Model Zoo: Github repository,” (2021). [Online]. Available: https://github.com/espnet/espnet_model_zoo (Accessed2024-03-13).
[86] D. Povey et al., “The Kaldi speech recognition toolkit,” in Proc. ASRU. IEEE, Hawaii, USA, 2011. [Online]. Available: https://infoscience.epfl.ch/record/192584

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-a5968f57-fef2-483a-bfc5-5a20cfeb0a26