Tytuł artykułu
Autorzy
Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Konferencja
Federated Conference on Computer Science and Information Systems (19 ; 08-11.09.2024 ; Belgrade, Serbia)
Języki publikacji
Abstrakty
In this paper we study the impact of augmenting spoken language corpora with domain-specific synthetic samples for the purpose of training a speech recognition system. Using both a conventional neural text-to-speech system and a zero-shot one with voice cloning ability we generate speech corpora that vary in the number of voices. We compare speech recognition models trained with addition of different amounts of synthetic data generated using these two methods with a baseline model trained solely on voice recordings. We show that while the quality of voice-cloned dataset is lower, its increased multivoiceity makes it much more effective than the one with only a few voices synthesized with the use of a conventional neural text-to-speech system. Furthermore, our experiments indicate that using low variability synthetic speech quickly leads to saturation in the quality of the ASR whereas high variability speech provides improvement even when increasing total amount of data used for training by 30%.
Rocznik
Tom
Strony
579--584
Opis fizyczny
Bibliogr. 41 poz., il., tab., wykr.
Twórcy
autor
- Samsung R&D Institute Poland Plac Europejski 1 00-844 Warszawa, Poland
autor
- Samsung R&D Institute Poland Plac Europejski 1 00-844 Warszawa, Poland
autor
- Samsung R&D Institute Poland Plac Europejski 1 00-844 Warszawa, Poland
autor
- Samsung R&D Institute Poland Plac Europejski 1 00-844 Warszawa, Poland
autor
- Samsung R&D Institute Poland Plac Europejski 1 00-844 Warszawa, Poland
autor
- Adam Mickiewicz University, Poland, Faculty of Mathematics and Computer Science, ul. Uniwersytetu Poznanskiego 4, 61-614 Poznan, Poland
autor
- Adam Mickiewicz University, Poland, Faculty of Mathematics and Computer Science, ul. Uniwersytetu Poznanskiego 4, 61-614 Poznan, Poland
Bibliografia
- 1. A. Fazel, W. Yang, Y. Liu, R. Barra-Chicote, Y. Meng, R. Maas, and J. Droppo, “Synthasr: Unlocking synthetic data for speech recognition,” 2021.
- 2. S. Ueno, M. Mimura, S. Sakai, and T. Kawahara, “Data augmentation for asr using tts via a discrete representation,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2021, pp. 68–75.
- 3. N. Rossenbach, M. Zeineldeen, B. Hilmes, R. Schlüter, and H. Ney, “Comparing the benefit of synthetic training data for various automatic speech recognition architectures,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2021, pp. 788–795.
- 4. X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” 2021.
- 5. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 4779–4783.
- 6. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” 2019.
- 7. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” 2023.
- 8. Z. Zhang, L. Zhou, C. Wang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” 2023.
- 9. K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” 2023.
- 10. M. Bartelds, N. San, B. McDonnell et al., “Making more of little data: Improving low-resource automatic speech recognition using data augmentation,” in Proc.Annual Meeting of the Association for Computational Linguistics, 2023, pp. 715–729.
- 11. N. Rossenbach, A. Zeyer, R. Schlüter, and H. Ney, “Generating synthetic audio data for attention-based speech recognition systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7069–7073.
- 12. X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end asr systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 5674–5678.
- 13. M. Kubis, P. Skórzewski, M. Sowański, and T. Ziętkiewicz, “Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, December 2023, pp. 11 824–11 835.
- 14. M. Kubis, P. Skórzewski, M. Sowański, and T. Ziętkiewicz, “Center for Artificial Intelligence Challenge on Conversational AI Correctness,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Ślęzak, Eds., vol. 35. IEEE, 2023, pp. 1319–1324.
- 15. K. Yang, T.-Y. Hu, J.-H. R. Chang, H. Swetha Koppula, and O. Tuzel, “Text is all you need: Personalizing asr models using controllable speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- 16. T.-Y. Hu, M. Armandpour, A. Shrivastava, J.-H. R. Chang, H. Koppula, and O. Tuzel, “Synt++: Utilizing imperfect synthetic data to improve speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7682–7686.
- 17. M. Le, A. Vyas, B. Shi et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 14 005–14 034.
- 18. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 5206–5210.
- 19. K. Ito and L. Johnson, “The lj speech dataset,” 2017.
- 20. E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang, “Hi-Fi Multi-Speaker English TTS Dataset,” in Interspeech, 2021, pp. 2776–2780.
- 21. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” CoRR, vol. abs/1904.02882, 2019.
- 22. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Oriental COCOSDA 2017, 2017, p. Submitted.
- 23. Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi-speaker mandarin TTS corpus and the baselines,” CoRR, vol. abs/2010.11567, 2020.
- 24. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proc. 12th Conference on Language Resources and Evaluation, 2020, pp. 4211–4215.
- 25. A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” 2022.
- 26. E. Bastianelli, A. Vanzo, P. Swietojanski, and V. Rieser, “SLURP: A spoken language understanding resource package,” in Proc. 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 7252–7262.
- 27. C. Veaux, J. Yamagishi, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2019.
- 28. J. Park, S. Jin, J. Park et al., “Conformer-based on-device streaming speech recognition with kd compression and two-pass architecture,” in IEEE Spoken Language Technology Workshop, 2023, pp. 92–99.
- 29. M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
- 30. J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at 1.6 kb/s using LPCNet,” in Interspeech, 2019.
- 31. Y. Wang, R. J. Skerry-Ryan, D. Stanton et al., “Tacotron: Towards end-to-end speech synthesis,” in Interspeech, 2017.
- 32. N. Ellinas, G. Vamvoukakis, K. Markopoulos et al., “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Interspeech, 2020, pp. 2022–2026.
- 33. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” 2022.
- 34. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019.
- 35. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
- 36. J. Tiedemann and S. Thottingal, “OPUS-MT - Building open translation services for the World,” in Proc. 22nd Annual Conferenec of the European Association for Machine Translation, 2020.
- 37. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
- 38. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” 2022.
- 39. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17 022–17 033.
- 40. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
- 41. Y. Jia, Y. Zhang, R. Weiss et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
Uwagi
Thematic Sessions: Short Papers
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-1e3650d8-11ab-41b3-b8c8-91a731b9b72a
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.