Hybrid CNN-Ligru acoustic modeling using sincnet raw waveform for hindi ASR

Kumar, Ankit; Aggarwal, Rajesh Kumar

doi:10.7494/csci.2020.21.4.3748

Artykuł - szczegóły

Tytuł artykułu

Hybrid CNN-Ligru acoustic modeling using sincnet raw waveform for hindi ASR

Autorzy

Kumar Ankit , Aggarwal Rajesh Kumar

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.7494/csci.2020.21.4.3748

Warianty tytułu

Języki publikacji

Abstrakty

Deep neural networks (DNN) currently play a most vital role in automatic speech recognition (ASR). The convolution neural network (CNN) and recurrent neural network (RNN) are advanced versions of DNN. They are right to deal with the spatial and temporal properties of a speech signal, and both properties have a higher impact on accuracy. With its raw speech signal, CNN shows its superiority over precomputed acoustic features. Recently, a novel first convolution layer named SincNet was proposed to increase interpretability and system performance. In this work, we propose to combine SincNet-CNN with a light-gated recurrent unit (LiGRU) to help reduce the computational load and increase interpretability with a high accuracy. Different configurations of the hybrid model are extensively examined to achieve this goal. All of the experiments were conducted using the Kaldi and Pytorch-Kaldi toolkit with the Hindi speech dataset. The proposed model reports an 8.0% word error rate (WER).

Słowa kluczowe

automatic speech recognition CNN CNN-LiGRU DNN

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2020

Tom

T. 21 (4)

Strony

397--417

Opis fizyczny

Bibliogr. 57 poz., rys., tab.

Twórcy

autor

Kumar Ankit

anketvit@gmail.com

National Institute of Technology, Department of Computer Engineering, Kurukshetra, Haryana, India

autor

Aggarwal Rajesh Kumar

rka15969@gmail.com

Aggarwal National Institute of Technology, Department of Computer Engineering, Kurukshetra, Haryana, India

Bibliografia

[1] Abdel-Hamid O., Mohamed A.r., Jiang H., Deng L., Penn G., Yu D.: Convolutional Neural Networks for Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22(10), pp. 1533–1545, 2014.
[2] Abdel-Hamid O., Mohamed A.r., Jiang H., Penn G.: Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, IEEE, 2012.
[3] Ba J.L., Kiros J.R., Hinton G.E.: Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[4] Bengio Y., Simard P., Frasconi P.: Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol. 5(2), pp. 157–166, 1994.
[5] Biswas A., Sahu P.K., Chandra M.: Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition. Computers & Electrical Engineering, vol. 40(4), pp. 1111–1122, 2014.
[6] Chawla A., Lee B., Fallon S., Jacob P.: Host Based Intrusion Detection System with Combined CNN/RNN Model. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 149–158, Springer, 2018.
[7] Cho K., Merri¨enboer van B., Bahdanau D., Bengio Y.: On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv preprint arXiv:1409.1259, 2014.
[8] Dahl G.E., Yu D., Deng L., Acero A.: Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20(1), pp. 30–42, 2011.
[9] Dua M., Aggarwal R.K., Biswas M.: Discriminative Training Using Noise Robust Integrated Features and Refined HMM Modeling, Journal of Intelligent Systems, vol. 29(1), pp. 327–344, 2018.
[10] Dua M., Aggarwal R.K., Biswas M.: Optimizing Integrated Features for Hindi Automatic Speech Recognition System, Journal of Intelligent Systems, vol. 29(1), pp. 959–976, 2018.
[11] Dua M., Aggarwal R.K., Biswas M.: Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling, Neural Computing and Applications, vol. 31(10), pp. 6747–6755, 2019.
[12] Dua M., Aggarwal R.K., Biswas M.: GFCC based discriminatively trained noise robust continuous ASR system for Hindi language, Journal of Ambient Intelligence and Humanized Computing, vol. 10(6), pp. 2301–2314, 2019.
[13] Elman J.L.: Finding structure in time, Cognitive Science, vol. 14(2), pp. 179–211, 1990.
[14] Glorot X., Bengio Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. 2010.
[15] Goodfellow I., Bengio Y., Courville A.: Deep learning, MIT Press, 2016.
[16] Greff K., Srivastava R.K., Koutnık J., Steunebrink B.R., Schmidhuber J.: LSTM: A Search Space Odyssey, IEEE Transactions on Neural Networks and Learning Systems, vol. 28(10), pp. 2222–2232, 2016.
[17] Hinton G., Deng L., Yu D., Dahl G.E., Mohamed A.r., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T.N., Kingsbury B.: Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, IEEE Signal Processing Magazine, vol. 29(6), pp. 82–97, 2012.
[18] Hinton G.E., Srivastava N., Krizhevsky A., Sutskever I., Salakhutdinov R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[19] Hochreiter S., Schmidhuber J.: Long Short-Term Memory, Neural Computation, vol. 9(8), pp. 1735–1780, 1997.
[20] Hoshen Y., Weiss R.J., Wilson K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE, 2015.
[21] Hoshen Y., Weiss R.J., Wilson K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE, 2015.
[22] Ioffe S., Szegedy C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv preprint arXiv:1502.03167, 2015.
[23] Jung J.W., Heo H.S., Yang I.H., Shim H.J., Yu H.J.: Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification, Extraction, vol. 8(12), pp. 23–24, 2018.
[24] Kang J., Zhang W.Q., Liu W.W., Liu J., Johnson M.T.: Advanced recurrent network-based hybrid acoustic models for low resource speech recognition, EURASIP Journal on Audio, Speech, and Music Processing, vol. 2018(1), p. 6, 2018.
[25] Kingma D.P., Ba J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.
[26] Kumar A., Aggarwal R.: A Time Delay Neural Network Acoustic Modeling for Hindi Speech Recognition. In: Advances in Data and Information Sciences, pp. 425–432, Springer, 2020.
[27] Le Q.V., Jaitly N., Hinton G.E.: A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv preprint arXiv:1504.00941, 2015.
[28] LeCun Y., Bottou L., Bengio Y., Haffner P.: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, vol. 86(11), pp. 2278–2324, 1998.
[29] Loweimi E., Bell P., Renals S.: On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters. In: Interspeech, pp. 3480–3484, 2019.
[30] Mitra S.K., Kuo Y.: Digital Signal Processing: A Computer-Based Approach, vol. 2, McGraw-Hill, New York, 2006.
[31] Moon T., Choi H., Lee H., Song I.: RNNDROP: A novel dropout for RNNS in ASR. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65–70. IEEE, 2015.
[32] Noe P.G., Parcollet T., Morchid M.: CGCNN: Complex Gabor Convolutional Neural Network on raw speech. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7724–7728, IEEE, 2020.
[33] Palaz D., Magimai-Doss M., Collobert R.: Analysis of CNN-Based Speech Recognition System Using Raw Speech as Input. Technical report, Idiap, 2015.
[34] Parcollet T., Morchid M., Linares G.: E2E-SINCNET: Toward Fully End-To-End Speech Recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7714–7718. IEEE, 2020.
[35] Passricha V., Aggarwal R.K.: a Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition, Journal of Intelligent Systems, vol. 1 (ahead -of-print), 2019.
[36] Passricha V., Aggarwal R.K.: A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR. Journal of Ambient Intelligence and Humanized Computing, vol. 11(2), pp. 675–691, 2020.
[37] Peddinti V., Povey D., Khudanpur S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[38] Povey D., Ghoshal A., Boulianne G., Burget L., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, 2011.
[39] Qian Y., Bi M., Tan T., Yu K.: Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24(12), pp. 2263–2276, 2016.
[40] Ravanelli M., Bengio Y.: Speech and Speaker Recognition From Raw Waveform with Sincnet. arXiv preprint arXiv:1812.05920, 2018.
[41] Ravanelli M., Brakel P., Omologo M., Bengio Y.: Light Gated Recurrent Units for Speech Recognition, IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2(2), pp. 92–102, 2018.
[42] Ravanelli M., Parcollet T., Bengio Y.: The Pytorch-Kaldi Speech Recognition Toolkit. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6465–6469. IEEE, 2019.
[43] Ravanelli M, Bengio Y.: Interpretable Convolutional Filters with SincNet. arXiv preprint arXiv:1811.09725, 2018.
[44] Rumelhart D.E., Hinton G.E., Williams R.J.: Learning Internal Representations by Error Propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[45] Sainath T.N., Kingsbury B., Saon G., Soltau H., Mohamed A.r., Dahl G., Ramabhadran B.: Deep Convolutional Neural Networks for Large-Scale Speech Tasks, Neural Networks, vol. 64, pp. 39–48, 2015.
[46] Samudravijaya K., Rao P.V.S., Agrawal S.S.: Hindi Speech Database. In: Sixth International Conference on Spoken Language Processing (ICSLP 2000), 2000.
[47] Schmidhuber J.: Deep Learning in Neural Networks: An Overview, Neural Networks, vol. 61, pp. 85–117, 2015.
[48] Schuster M., Paliwal K.K.: Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol. 45(11), pp. 2673–2681, 1997.
[49] Seide F., Li G., Yu D.: Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In: Twelfth Annual Conference of the International Speech Communication Association, 2011.
[50] Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R.: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, The Journal of Machine Learning Research, vol. 15(1), pp. 1929–1958, 2014.
[51] Stolcke A.: SRILM – an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing, 2002.
[52] Tuske Z., Golik P., Schluter R., Ney H.: Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[53] Xu C., Shen J., Du X., Zhang F.: An Intrusion Detection System Using a Deep Neural Network with Gated Recurrent Units, IEEE Access, vol. 6, pp. 48697– 48707, 2018.
[54] Yin W., Kann K., Yu M., Schutze H.: Comparative Study of CNN and RNN for Natural Language Processing. arXiv preprint arXiv:1702.01923, 2017.
[55] Yu D., Deng L.: Automatic Speech Recognition, Springer, 2016.
[56] Yu D., Li J.: Recent progresses in deep learning based acoustic models, IEEE/CAA Journal of Automatica Sinica, vol. 4(3), pp. 396–409, 2017.
[57] Zeghidour N., Usunier N., Synnaeve G., Collobert R., Dupoux E.: End-to-End Speech Recognition From the Raw Waveform. arXiv preprint arXiv:1806.07098, 2018.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-e21f8898-7060-43f1-8705-d1ac32885f7b