PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Cloning the voice and speech of Piotr Fronczewski for Polish speech synthesis

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The quality of synthetically generated speech has improved significantly in recent years, largely due to the technological development of speech synthesis systems, in particular those based on deep neural networks (DNN). However, the problem of emotion in speech synthesis still remains a challenge. Most of the existing speech synthesis systems do not convey the pervasive emotional contexts in human-human interaction. The lack of expression limits the emotional intelligence of current speech synthesis systems. This work aimed to develop a recording method for preparing a balanced corpus of emotional recordings in the Polish language for use in speech synthesis based on artificial intelligence (AI) algorithms. An essential aspect of the work was the selection of a voice-over artist who would allow the recording of the spectrum of an actor's voice, emphasizing the actor's interpretations and emotions derived from the content. Outstanding actor Piotr Fronczewski was chosen for the role.
Rocznik
Strony
art. no. 2024112
Opis fizyczny
Bibliogr. 52 poz., il. kolor., fot., wykr
Twórcy
  • Multimedia Department, Polish-Japanese Academy of Information Technology, 02-008 Warsaw, Poland, Koszykowa 86
Bibliografia
  • 1. J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, RJ. Skerry-Ryan, R.A. Saurous, Y. Agiomyrgiannakis, Y. Wu; Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions; https://arxiv.org/abs/1712.05884
  • 2. N. Kaur, P. Singh; Conventional and contemporary approaches used in text to speech synthesis: A review; Artificial Intelligence Review, 2023, 56, 5837-5880
  • 3. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu; Fastspeech 2: Fast and high-quality end-to-end text to speech; arXiv preprint, 2020; https://arxiv.org/abs/2006.04558
  • 4. W. Hu, X. Zhu; A real-time voice cloning system with multiple algorithms for speech quality improvement; Plos one, 2023, 18(4), e0283440
  • 5. Y. Lei, S. Yang, X. Wang, L. Xie; Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis; IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30, 853-864
  • 6. CereProc Company Website; http://www.cereproc.com/ (accessed on 2023.07.17)
  • 7. Train Your Voice Model; https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-voice-create-voice (accessed on 2023.07.17)
  • 8. Narakeet Company Website; https://www.narakeet.com/ (accessed on 2023.07.28)
  • 9. Speechify Company Website; https://speechify.com/text-to-speech-online/ (accessed on 2023.07.28)
  • 10. K. Szklanny, J. Lachowicz; Implementing a Statistical Parametric Speech Synthesis System for a Patient with Laryngeal Cancer; Sensors, 2022, 22(9), 3188
  • 11. Resemble Company Website; https://www.resemble.ai/cloned/ (accessed on 2023.07.17)
  • 12. Elevenlabs Company Website; https://beta.elevenlabs.io/voice-lab (accessed on 2023.07.17)
  • 13. Beyondwords Company Website; https://beyondwords.io/ai-voice-ethics/ (accessed on 2023.07.17)
  • 14. Synthesia Company Website; https://www.synthesia.io/ (accessed on 2023.07.17)
  • 15. Y.A. Li, C. Han, N. Mesgarani; Styletts: A style-based generative model for natural and diverse text-to-speech synthesis; arXiv preprint, 2022; https://arxiv.org/abs/2205.15439
  • 16. X. Cai, D. Dai, Z. Wu, X. Li, J. Li, H. Meng; Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition; In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, 5734-5738
  • 17. K. Szklanny, S. Koszuta; Implementation and verification of speech database for unit selection speech synthesis; In: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), Prague, Czech Republic, 3-6 September 2017
  • 18. K. Szklanny; Optymalizacja funkcji kosztu w korpusowej syntezie mowy polskiej; PhD thesis, Polsko-Japońska Wyższa Szkoła Technik Komputerowych Warszawa, Warszawa, Poland, 2009
  • 19. D. Oliver, K. Szklanny; Creation and analysis of a Polish speech database for use in unit selection synthesis; In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, 24-26 May 2006
  • 20. K. Szklanny; Multimodal Speech Synthesis for Polish Language; In: Man-Machine Interactions 3. Advances in Intelligent Systems and Computing ; D. Gruca, T. Czachórski, S. Kozielski, Eds.; Springer, 2014, 242, 325-333
  • 21. A.S. Bailador; CorpusCrt; Technical report, Polytechnic University of Catalonia (UPC), 1998
  • 22. B. Bozkurt, O. Ozturk, T. Dutoit; Text design for TTS speech corpus building using a modified greedy selection; In: Proceedings of the Eighth European Conference on Speech Communication and Technology, Geneva, Switzerland, 1-4 September 2003
  • 23. R.A. Clark, K. Richmond, S. King; Multisyn: Open-domain unit selection for the Festival speech synthesis system; Speech Commun., 2007, 49, 317-330
  • 24. P. Boersma; Praat, a system for doing phonetics by computer; Glot. Int., 2001, 5(9), 341-345
  • 25. D. Kamińska, T. Sapiński; Polish emotional speech recognition based on the committee of classifiers; Przegląd Elektrotechniczny, 2017, 93, 101-105
  • 26. F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C. Busso, K. Truong; The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing; IEEE transactions on affective computing, 2015, 7(2), 190-202
  • 27. F. Eyben, M. Wöllmer, B. Schuller; Opensmile: the munich versatile and fast open-source audio feature extractor; In: Proceedings of the 18th ACM international conference on Multimedia, 2010, 1459-1462
  • 28. K. Klessa, M. Karpiński, A. Wagner; Annotation Pro - a new software tool for annotation of linguistic and paralinguistic features; In: Proceedings of the Tools and Resources for the Analysis of Speech Prosody (TRASP) Workshop; D. Hirst, B. Bigi, Eds.; Aix en Provence, 2013, 51-54
  • 29. F. Burkhard, A.,Paeschkhe, M. Rolfes, W. Sendlmeier, B. Weiss; A Database of German Emotional Speech; In: Proc. of Interspeech 2005, Lissabon, Portugal, 2005
  • 30. E. Douglas-Cowie, N. Campbell, R. Cowie, P. Roach; Emotional speech: Towards a new generation of databases; Speech Communication, 2003, 40, 33-60
  • 31. S.J. Jovcic, Z. Kasic, M. Dordevic, M. Rajkovic; Serbian emotional speech database: design, processing and evaluation; In: Proc. SPECOM 2004, St. Petersburg, Russia, 2004
  • 32. D. Ververdis, C. Kotropoulos; A State of the Art on Emotional Speech Databases; In: Proc. of 1st Richmedia Conf. Laussane, Switzerland, October 2003, 109-119
  • 33. P. Staroniewicz; Polish emotional speech database - design; In: Proc. of 55th Open Seminar on Acoustics, Wroclaw, Poland, 2008, 373-378
  • 34. P. Staroniewicz, W. Majewski; Polish emotional speech database - recording and preliminary validation; In: Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions: COST Action 2102 International Conference Prague, Czech Republic, 15-18 October 2008; Revised Selected and Invited Papers; Springer, 2009, 42-49
  • 35. R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, T. Alhussain; Speech emotion recognition using deep learning techniques: A review; IEEE Access, 2019, 7, 117327-117345
  • 36. R. Plutchik, H. Kellerman; Emotion, theory, research, and experience: theory, research and experience; Academic Press, 1980
  • 37. K. Zhou, B. Sisman, R. Rana, B.W. Schuller, H. Li; Speech synthesis with mixed emotions; IEEE Transactions on Affective Computing, 2022, 14(4), 3120-3134
  • 38. K. Szklanny, R. Gubrynowicz, K. Iwanicka-Pronicka, A. Tylki-Szymańska; Analysis of voice quality in patients with late-onset Pompe disease; Orphanet Journal of Rare Diseases, 2016, 11(1), 1-9
  • 39. K. Szklanny, A. Tylki-Szymańska; Follow-up analysis of voice quality in patients with late-onset Pompe disease; Orphanet Journal of Rare Diseases, 2018, 13(1), 1-7
  • 40. K. Szklanny, R. Gubrynowicz, A. Tylki-Szymańska; Voice alterations in patients with Morquio A syndrome; Journal of applied genetics, 2018, 59, 73-80
  • 41. K. Szklanny, P. Wrzeciono; The application of a genetic algorithm in the noninvasive assessment of vocal nodules in children; IEEE Access, 2019, 7, 44966-44976
  • 42. K. Szklanny; Acoustic Parameters in the Evaluation of Voice Quality of Choral Singers. Prototype of Mobile Application for Voice Quality Evaluation; Archives of Acoustics, 2019, 44(3), 439-446
  • 43. M.P. Fabre; Un procede electrique percutane d'inscription de I'accolement glottique au cours de la phonationl; glottographie de haute frequence. Premiers resultats; Bull Acad Nat Med, 1957, 141, 66-69
  • 44. R.J. Baken; Clinical measurement of speech and voice; College-Hill Press, 1987
  • 45. L. Cveticanin; Review on mathematical and mechanical models of the vocal cord; Journal of Applied Mathematics, 2012, 928591
  • 46. D. Kaminska, T. Sapinski, A. Pelikant; Polish Emotional Natural Speech Database; In: Proceedings of the Conference: Signal Processing Symposium, 2015
  • 47. M. Igras, B. Ziółko; Baza danych nagrań mowy emocjonalnej; Studia Informatica, 2013, 34, 67-77
  • 48. T. Sapiński, D. Kamińska, A. Pelikant, C. Ozcinar, E. Avots, G. Anbarjafari; Multimodal database of emotional speech, video and gestures; In: Pattern Recognition and Information Forensics: ICPR 2018 International Workshops, CVAUI, IWCF, and MIPPSNA, Beijing, China, 20-24 August 2018; Revised Selected Papers 24, Springer International Publishing, 2019, 153-163
  • 49. Z. Piątek, M. Kłaczyński; Acoustic Methods in Identifying Symptoms of Emotional States; Archives of Acoustics, 2021, 46(2), 259-269
  • 50. A. Janicki, M. Turkot; Rozpoznawanie stanu emocjonalnego mówcy z wykorzystaniem maszyny wektorów wspierających (svm); Krajowe Sympozjum Telekomunikacji i Teleinformatyki, Bydgoszcz, 2008
  • 51. J. Cichosz; The use of selected speech signal features to recognize and model emotions for the Polish language [in Polish: Wykorzystanie wybranych cech sygnału mowy do rozpoznawania i modelowania emocji dla języka polskiego]; PhD thesis, Lodz University of Technology, Łódź, 2008
  • 52. G. Demenko, M. Jastrzębska; Analiza stresu głosowego w rozmowach z telefonu alarmowego; XVIII Conference on Acoustic and Biomedical Engineering, Zakopane, Poland, 2011
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-10c76c54-cbc0-40f9-8c1b-aaf28c1688a2
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.