PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Performance evaluation of deep neural networks applied to speech recognition : RNN, LSTM and GRU

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Deep Neural Networks (DNN) are nothing but neural networks with many hidden layers. DNNs are becoming popular in automatic speech recognition tasks which combines a good acoustic with a language model. Standard feedforward neural networks cannot handle speech data well since they do not have a way to feed information from a later layer back to an earlier layer. Thus, Recurrent Neural Networks (RNNs) have been introduced to take temporal dependencies into account. However, the shortcoming of RNNs is that long-term dependencies due to the vanishing/exploding gradient problem cannot be handled. Therefore, Long Short-Term Memory (LSTM) networks were introduced, which are a special case of RNNs, that takes long-term dependencies in a speech in addition to shortterm dependencies into account. Similarily, GRU (Gated Recurrent Unit) networks are an improvement of LSTM networks also taking long-term dependencies into consideration. Thus, in this paper, we evaluate RNN, LSTM, and GRU to compare their performances on a reduced TED-LIUM speech data set. The results show that LSTM achieves the best word error rates, however, the GRU optimization is faster while achieving word error rates close to LSTM.
Rocznik
Strony
235--245
Opis fizyczny
Bibliogr.41 poz., rys.
Twórcy
Bibliografia
  • [1] G. E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18, 1527-1554, 2006.
  • [2] A. Rousseau, P. Deleglise, Y. Est ´ eve, Enhancing ` the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. Proceedings of Sventh Language Resources and Evaluation Conference, 3935-3939, May 2014.
  • [3] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word Lattices to Incorporate Human Corrections, Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 2016.
  • [4] E. Busseti, I. Osband, S. Wong, Deep Learning for Time Series Modeling, Seminar on Collaborative Intelligence in the TU Kaiserslautern, Germany, June 2012.
  • [5] Deep Learning for Sequential Data - Part V:Handling Long Term Temporal Dependencies,https://prateekvjoshi.com/2016/05/31/deep-learning-for-sequential-data-part-v-handling-long-term-temporal-dependencies/, last retrieved July 2017. [6] Understanding LSTM Networks,http://colah.github.io/posts/2015-08-Understanding-LSTMs/, last retrieved July 2017.
  • [7] A. Graves, A. R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6645-6649, 2013.
  • [8] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber, ´Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, 369-376, ACM, June 2006.
  • [9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio,Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • [10] TED-LIUM Corpus, http://www-lium.univlemans.fr/en/content/ted-lium-corpus, last retrievedJuly 2017.
  • [11] C. C. Chiu, D. Lawson, Y. Luo, G.Tucker, K. Swersky, I. Sutskever, N. Jaitly, An online sequence-tosequence model for noisy speech recognition, arXiv ,reprint arXiv:1706.06428, 2017.
  • [12] T. Hori, S. Watanabe, Y. Zhang, W. Chan, Advances in Joint CTC-Attention based End-to-End ,Speech Recognition with a Deep CNN Encoder and,RNN-LM, arXiv preprint arXiv:1706.02737, 2017.
  • [13] W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen,attend and spell. arXiv preprint arXiv:1508.01211,2015.
  • [14] T. Mikolov, Statistical language models based on neural networks, PhD thesis, Brno University of Technology, 2012.
  • [15] W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
  • [16] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 3104-3112, 2014.
  • [17] F. A. Gers, E. Schmidhuber, LSTM recurrent networks learn simple context-free and contextsensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333-1340, 2001.
  • [18] O. Vinyals, S. V. Ravuri, D. Povey, Revisiting recurrent neural networks for robust ASR. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4085-4088, 2012.
  • [19] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals,P. Nguyen, A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR. Thirteenth Annual Conference of the International Speech Communication Association, 2012.
  • [20] H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014.
  • [21] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks,18(5), 602-610, 2005.
  • [22] A. Graves, S. Fernandez, J. Schmidhuber, Bidi- ´rectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W., Kacprzyk J., Oja E., Zadrozny S. (eds) Artificial ˙Neural Networks: Formal Models and Their Applications – ICANN, Lecture Notes in Computer Science, vol. 3697, Springer, Berlin, Heidelberg, 2005.
  • [23] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained handwriting recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855-868, 2009.
  • [24] A. Graves, N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (ICML-14), 1764-1772, 2014.
  • [25] A. Graves, N. Jaitly, A. R. Mohamed, Hybrid speech recognition with deep bidirectional LSTM. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 273-278, December 2013.
  • [26] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates A. Y. Ng. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
  • [27] H. Xu, G. Chen, D. Povey, S. Khudanpur, Modeling phonetic context with non-random forests for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [28] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association, 3586-3589, 2015.
  • [29] G. Chen, H. Xu, M. Wu, D. Povey, S. Khudanpur, Pronunciation and silence probability modeling for ASR. Sixteenth Annual Conference of the International Speech Communication Association, 2015. [30] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word Lattices to Incorporate Human Corrections.Seventeenth Annual Conference of the International Speech Communication Association, 3062-3065, 2016.
  • [31] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.
  • [32] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel,Y. Bengio, End-to-end attention-based large vocabulary speech recognition. 2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), 4945-4949, March 2016.
  • [33] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory. Neural Comput. 9, 8, 1735-1780, November 1997.
  • [34] K. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, Technical report, arXiv preprint arXiv:1409.0473, 2014.
  • [35] D. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [36] Reduced TED-LIUM release 2 corpus (11.7 GB), http://www.cs.ndsu.nodak.edu/˜siludwig/data/TEDLIUM release2.zip, last retrieved July 2017.
  • [37] Speech recognition performance, https://en.wikipedia.org/wiki/Speech recognition#Performance,last retrieved July 2017.
  • [38] Levenshtein distance, https://en.wikipedia.org/wiki/Levenshtein distance, last retrieved July 2017.
  • [39] A. C. Morris, V. Maier, P. Green, From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. Eighth International Conference on Spoken Language Processing, 2004.
  • [40] Word error rate, https://en.wikipedia.org/wiki/Word error rate, last retrieved July 2017.
  • [41] A. Marzal, E. Vidal, Computation of normalized edit distance and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9),926-932, 1993.
Uwagi
Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-0106e25d-92b6-4c93-8317-367a9f574578
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.