PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Powiadomienia systemowe
  • Sesja wygasła!
  • Sesja wygasła!
Tytuł artykułu

A comparative study of deep End-to-End Automatic Speech Recognition models for doctor-patient conversations in Polish in a real-life acoustic environment

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The following paper presents research on the Automatic Speech Recognition (ASR) methods for the construction of a system to automatically transcribe the medical interview in Polish language during a visit in the clinic. Performance of four ASR models based on Deep Neural Networks (DNN) was evaluated. The applied structures included XLSR-53 large, Quartznet15x5, FastConformer Hybrid Transducer-CTC and Whisper large. The study was conducted on a self-developed speech dataset. Models were evaluated using Word Error Rate (WER), Character Error Rate (CER), Match Error Rate (MER), Word Accuracy (WAcc), Word Information Preserved (WIP), Word Information Lost (WIL), Levenshtein distance, Jaro - Winkler similarity and Jaccard index. The results show that the Whisper model outperformed other tested solutions in the vast majority of the conducted tests. Whisper achieved a WER =20.84%, where XLSR-53 WER = 67.96%, Quartznet15x5 WER =76.25%, FastConformer WER = 46.30%. These results show that Whisper needs further adaptation for medical conversations, as current volume of transcription errors is not practically acceptable (too many mistakes in the description of the patient's health description).
Twórcy
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
autor
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Central Institute For Labour Protection-National Research Institute, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Central Institute For Labour Protection-National Research Institute, Poland
  • Central Institute For Labour Protection-National Research Institute, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland
  • Faculty of Applied Informatics and Mathematics, Warsaw University of Life Sciences, Poland
  • JAS Technologie Sp. z o.o, Poland
  • JAS Technologie Sp. z o.o, Poland
Bibliografia
  • [1] J. Li et al., “Recent advances in end-to-end automatic speech recognition,” arXiv (Cornell University), 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2111.01690
  • [2] K. Kuligowska, M. Stanusch, and M. Koniew, “Challenges of automatic speech recognition for medical interviews-research for polish language,” Procedia Computer Science, vol. 225, pp. 1134-1141, 2023.
  • [3] A. Czyżewski, “Optimizing medical personnel speech recognition models using speech synthesis and reinforcement learning,” Journal of the Acoustical Society of America, vol. 154, pp. A202-A203, 2023. M. Zielonka, W. Krasiński, J. Nowak, P. Rośleń, J. Stopiński, M. Żak, F. Górski, and A. Czyżewski, “A survey of automatic speech recognition deep models performance for polish medical terms,” in 2023 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA). Poznan, Poland: IEEE, 2023, pp. 19 - 24 S. Zaporowski, “The impact of foreign accents on the performance of whisper family models using medical speech in polish,” in Harnessing Opportunities: Reshaping ISD in the post-COVID-19 and Generative AI Era (ISD2024 Proceedings), B. Marcinkowski, A. Przybylek, A. Jarzebowicz, N. Iivari, E. Insfran, M. Lang, H. Linger, and C. Schneider, Eds. Gdańsk, Poland: University of Gdańsk, 2024 B. Hnatkowska and J. Sas, “Application of automatic speech recognition to medical reports spoken in polish,” Journal of Medical Informatics Technologies, vol. 12, 2008 T.-B. Nguyen and A. Waibel, “Convoifilter: A case study of doing cocktail party speech recognition,” arXiv (Cornell University), Aug. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.11380
  • [4] Grosman, Jonatas, “Fine-tuned XLSR-53 large model for speech recognition in Polish,” https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-polish 2021, online; accessed 20 May 2024
  • [5] NVIDIA Nemo Toolkit, “STT Pl Quartznet15x5,” https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_pl_quartznet15x5, 2023, online; accessed 20 May 2024
  • [6] NVIDIA, “STT Pl FastConformer Hybrid Transducer-CTC Large P&C,” https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_pl_fastconformer_hybrid_large_pc, 2023, online; accessed 20 May 2024.
  • [7] OpenAI, “Whisper - general-purpose speech recognition mode,” https://github.com/openai/whisper, 2023, online; accessed 20 May 2024.
  • [8] E. Rudnicka, M. Maziarz, M. Piasecki, and S. Szpakowicz, “A strategy of mapping polish wordnet onto princeton wordnet,” in Proc. 24th COLING 2012: Posters. Mumbai, India: COLING 2012, 8-15 Dec. 2012, pp. 1039-1048. [Online]. Available: https://aclanthology.org/C12-2101/
  • [9] A. Pohl and B. Ziółko, “Using part of speech n-grams for improving automatic speech recognition of polish,” in Machine Learning and Data Mining in Pattern Recognition. Proc. of 9th Int. Conf., MLDM 2013. New York, NY, USA: Springer Berlin, Heidelberg, 19-25 Jul. 2013, pp. 492-504. [Online]. Available: https://doi.org/10.1007/978-3-642-39712-7 38
  • [10] G. Rehm and H. Uszkoreit, The Polish language in the digital age. Berlin, Heidelberg: Springer, Jan. 2012. [Online]. Available: https://doi.org/10.1007/978-3-642-30811-6
  • [11] A. Janicki and D. Wawer, “Automatic speech recognition for polish in a computer game interface,” in 2011 Federated Conf. on Computer Science and Information Systems (FedCSIS). Szczecin, Poland: IEEE, 18-21 Sep. 2011, pp. 711--716. [Online]. Available: https://ieeexplore.ieee.org/document/6078265
  • [12] K. Marasek, Brocki, D. Korzinek, K. Wołk, and R. Gubrynowicz, “Spoken language translation for polish,” arXiv (Cornell University), Nov. 2015. [Online]. Available: doi.org/10.48550/arXiv.1511.07788
  • [13] L. Pawlaczyk and P. Bosky, “Skrybot - a system for automatic speech recognition of polish language,” in Man-Machine Interactions. Berlin, Heidelberg: Springer, Jan. 2009, vol. 59, pp. 381-387. [Online]. Available: https://doi.org/10.1007/978-3-642-00563-3_40
  • [14] J. Nouza, P. Cerva, and R. Safarik, “Cross-lingual adaptation of broadcast transcription system to polish language using public data sources,” in Human Language Technology. Challenges for Computer Science and Linguistics. 7th Language and Technology Conf., LTC 2015. Poznań, Poland: Springer Cham, 27-29 Nov. 2015, pp. 31-41. [Online]. Available: https://doi.org/10.1007/978-3-319-93782-3_3
  • [15] B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, and J. Galka, “Application of HTK to the Polish language,” in ICALIP 2008 Int. Conf. on Audio, Language and Image Processing. Shanghai, China: IEEE, 7-9 Jul. 2008, pp. 31-41. [Online]. Available: https://doi.org/10.1109/ICALIP.2008.4590266
  • [16] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 4-8 May 2020, pp. 6124-6128. [Online]. Available: https://ieeexplore.ieee.org/xpl/conhome/9040208/proceeding?isnumber=9052899&sortType=vol-only-seq&searchWithin=Quartznet
  • [17] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” in INTERSPEECH 2019. Graz, Austria: ISCA Speech, 15-19 Sep. 2019, pp. 71-75. [Online]. Available: https://www.isca-archive.org/interspeech_2019/li19_interspeech.html
  • [18] NVIDIA NeMo: conversational AI toolkit,” (2023). [Online]. Available: https://github.com/NVIDIA/NeMo
  • [19] “Common Voice open source, multi-language dataset of voices,” (2023). [Online]. Available: https://commonvoice.mozilla.org/en/datasets
  • [20] J. Huang, O. Kuchaiev, P. O’Neill, V. Lavrukhin, J. Li, A. Flores, G. Kucsko, and B. Ginsburg, “Cross-language transfer learning, continuous learning, and domain adaptation for end-to-end automatic speech recognition,” in 2021 IEEE Int. Conf. on Multimedia and Expo (ICME). Shenzhen, China: IEEE, -05-09 Jul. 2021, pp. 1-6. [Online]. Available: https://doi.org/10.1109/ICME51207.2021.9428334
  • [21] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” in INTERSPEECH 2020. Shanghai, China: ISCA Speech, 25-29 Oct. 2020, pp. 2757-2761. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2826
  • [22] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation,” arXiv (Cornell University), Jan. 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2101.00390
  • [23] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution- augmented transformer for speech recognition,” in INTERSPEECH 2020. Shanghai, China: ISCA Speech, 25-29 Oct. 2020, pp. 5036-5040. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-3015
  • [24] D. Rekesh, S. Kriman, S. Majumdar, V. Noroozi, H. Juang, O. Hrinchuk, A. Kumar, and B. Ginsburg, “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv (Cornell University), May 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.05084
  • [25] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv (Cornell University), Nov. 2012. [Online]. Available: https://doi.org/10.48550/arXiv.1211.3711
  • [26] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv (Cornell University), Aug. 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1808.06226
  • [27] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems 33: Annu. Conf. on Neural Information Processing Systems (NeurIPS 2020), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds. Curran Associates, Inc., 6-12 Dec. 2020, pp. 12 449-12 460. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
  • [28] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in INTERSPEECH 2021. Brno, Czechia: ISCA Speech, 30 Aug. - 3 Sep. 2021, pp. 2426-2430. [Online]. Available: https://doi.org/10.21437/Interspeech.2021-329
  • [29] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. of the 23rd Int. Conf. on Machine learning. Pittsburgh Pennsylvania USA: Association for Computing Machinery, New York, United States, Jun. 2006, pp. 369-376. [Online]. Available: https://doi.org/10.1145/1143844.1143891
  • [30] P. Roach, S. Arnfield, W. Barry, J. Baltova, M. Boldea, A. Fourcin, W. Gonet, R. Gubrynowicz, E. Hallum, L. Lamel et al., “Babel: An eastern european multi-language database,” in Proc. of 4th Int. Conf. on Spoken Language Processing (ICSLP 96), vol. 3, Philadelphia, PA, USA, 3-6 Oct. 1996, pp. 1892-1893. [Online]. Available: https://www.isca-archive.org/icslp_1996/roach96_icslp.html#
  • [31] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 369-375. [Online]. Available: https://ieeexplore.ieee.org/document/8639038
  • [32] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. of the 40th Int. Conf. on Machine Learning, ser. Proc. of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. Honolulu, HI, United States: PMLR, 23-29 Jul 2023, pp. 28 492-28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
  • [33] A. Ali and S. Renals, “Word error rate estimation for speech recognition: e-wer,” in Proc. of the 56th Annu. Meeting of the Association for Computational Linguistics, vol. 2. ACL, 2018, pp. 20-24. [Online]. Available: https://aclanthology.org/P18-2
  • [34] A. C. Morris, V. Maier, and P. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition,” in Proc. INTERSPEECH 2004. ISCA Speech, Oct. 4-8 2004, pp. 2765-2768. [Online]. Available: https://www.isca-archive.org/interspeech_2004/morris04_interspeech.html
  • [35] V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.
  • [36] A. S. Lhoussain, G. Hicham, and Y. Abdellah, “Adaptating the levenshtein distance to contextual spelling correction,” International Journal of Computer Science and Applications, vol. 12, no. 1, pp. 127-133, 2015. [Online]. Available: https://www.researchgate.net/publication/273758433_Adaptating_the_Levenshtein_Distance_to_Contextual_Spelling_Correction
  • [37] M. A. Jaro, “Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida,” Journal of the American Statistical Association, vol. 84, no. 406, pp. 414-420, 1989.
  • [38] W. E. Winkler, “The state of record linkage and current research problems,” Statistical Research Division, US Bureau of the Census, Wachington, DC, 1999. [Online]. Available: https://www.researchgate.net/publication/2509449 The State of Record Linkage and Current Research Problems
  • [39] O. Rozinek and J. Mares, “Fast and precise convolutional jaro and jaro-winkler similarity,” in 2024 35th Conf. of Open Innovations Association (FRUCT). Tampere, Finland: IEEE, 2024, pp. 604-613.
  • [40] M. Masson and J. Carson-Berndsen, “Investigating phoneme similarity with artificially accented speech,” in Proc. of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology. Association for Computational Linguistics, 2023, pp. 49-57. [Online]. Available: https://aclanthology.org/2023.sigmorphon-1
Uwagi
1. Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
2. This work was supported by the National Centre for Research and Development under Project NFOSTRATEG-IV/0042/2022
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-b759e935-db25-4091-8597-0c5d1370c3d3
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.