Learning cross-lingual phonological and orthographic adaptations : a case study in improving neural machine translation between low-resource languages

Jha, Saurav; Sudhakar, Akhilesh; Singh, Anil Kumar

doi:10.15398/jlm.v7i2.214

Artykuł - szczegóły

Tytuł artykułu

Learning cross-lingual phonological and orthographic adaptations : a case study in improving neural machine translation between low-resource languages

Autorzy

Jha Saurav , Sudhakar Akhilesh , Singh Anil Kumar

Treść / Zawartość

Pełne teksty:

Jha_Learning cross-lingual phonological and orthographic adaptations_2_2019.pdf

Pobierz

Identyfikatory

DOI

10.15398/jlm.v7i2.214

Warianty tytułu

Języki publikacji

Abstrakty

Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no paralel corpora exist. Our work adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built from a bilingual dictionary of Hindi-Bhojpuri words. We demonstrate that our models can be effectively used for language pairs that have limited paralel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adaptations, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi-Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks; (ii) creating effective parallel corpora for resource constrained languages; and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings to perform character-level tasks.

Słowa kluczowe

neural machine translation Hindi Bhojpuri word transduction low resource language attention model

Wydawca

Instytut Podstaw Informatyki PAN

Czasopismo

Journal of Language Modelling

Rocznik

2019

Tom

Vol. 7, No. 2

Strony

101--142

Opis fizyczny

Bibliogr. 62 poz., rys., tab.

Twórcy

autor

Jha Saurav

MNNIT Allahabad, Prayagraj, India

autor

Sudhakar Akhilesh

IIT (BHU), Varanasi, India

autor

Singh Anil Kumar

IIT (BHU), Varanasi, India

Bibliografia

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng (2015), TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, https://www.tensorflow.org/, software available from tensorflow.org.
[2] Ankita Acharya (2015), Contrastive Study of Bundeli and Hindi Pronunciation, in regICON-2015: Regional Symposium on Natural Language Processing, Varanasi, India.
[3] Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky (1999), Statistical Machine Translation, in Final Report, JHU Summer Workshop, volume 30.
[4] Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton (2016), Layer Normalization, Computing Research Repository (CoRR), abs/1607.06450.
[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio (2014), Neural Machine Translation by Jointly Learning to Align and Translate, Computing Research Repository (CoRR), abs/1409.0473.
[6] Purnima Bali (2016), Evolution of Bengali Literature: An Overview, International Journal of English Language, Literature and Translation Studies, 3: 325-332.
[7] Rachel Beel and Jennifer N. Felder (2014), Phonological Adaptations of English Loanwords in Turkish, in Proceedings of the Big South Undergraduate Research Symposium (BigSURS), High Point, NC, USA.
[8] Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi (1994), Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 5 (2): 157-66.
[9] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico (2016), Neural versus Phrase-Based Machine Translation Quality: a Case Study, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 257-267, Austin, TX, USA.
[10] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov (2017), Enriching Word Vectors with Subword Information, Transactions of the Association for Computational Linguistics (TACL), 5: 135-146.
[11] Suniti Kumar Chatterji (1926), The Origin and Development of the Bengali Language, Calcutta University Press.
[12] Aditi Chaudhary, Chunting Zhou, Lori S. Levin, Graham Neubig, David R. Mortensen, and Jaime G. Carbonell (2018), Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations, in Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 3285-3295, Brussels, Belgium.
[13] Qiming Chen and Ren Wu (2017), CNN Is All You Need, Computing Research Repository (CoRR), abs/1712.09662.
[14] David Chiang (2005), A Hierarchical Phrase-based Model for Statistical Machine Translation, in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 263-270, Association for Computational Linguistics, Ann Arbor, Michigan.
[15] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio (2014), Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724-1734, Doha, Qatar.
[16] François Chollet et al. (2015), Keras, https://github.com/keras-team/keras.
[17] Monojit Choudhury (2003), Rule-based grapheme to phoneme mapping for Hindi speech synthesis, in 90th Indian Science Congress of the International Speech Communication Association (ISCA), Bangalore, India.
[18] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser (2018), Universal Transformers, Computing Research Repository (CoRR), abs/1807.03819.
[19] Etienne Denoual and Yves Lepage (2005), BLEU in characters: towards automatic MT evaluation in languages without word delimiters, in Companion Volume to the Proceedings of the Second International Joint Conference on Natural Language Processing, pp. 81-86.
[20] Etienne Denoual and Yves Lepage (2006), The character as an appropriate unit of processing for non-segmenting languages, in Speech Processing Society 12th Annual Meeting (NLP2006), pp. 731-734, Hiroshima University, Japan.
[21] Inger Ekman and Kalervo Järvelin (2007), Spoken Document Retrieval in a Highly Inflectional Language, in Proceedings of the NODALIDA Conference, pp. 44-50.
[22] Andrew Finch and Eiichiro Sumita (2009), Transliteration by bidirectional statistical machine translation, in Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 52-56, Association for Computational Linguistics (ACL), Singapore.
[23] Mikel L. Forcada and Ramón P. Ñeco (1997), Recursive Hetero-associative Memories for Translation, in IWANN.
[24] Sir George Grierson (1928), Linguistic Survey of India, The Journal of the Royal Asiatic Society of Great Britain and Ireland, 3: 711-718.
[25] Rohit Gupta, Pulkit Goyal, and Sapan Diwakar (2010), Transliteration among Indian Languages using WX Notation, in Proceedings of KONVENS, pp. 147-150, Saarbrücken, Germany.
[26] Jan Hajič (2000), Machine Translation of Very Close Languages, in Proceedings of the Applied Natural Language Processing Conference (ANLP), pp. 7-12, Seattle, Washington, USA.
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016), Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778.
[28] Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio (2015), On Using Very Large Target Vocabulary for Neural Machine Translation, in Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 1-10, Beijing, China.
[29] Nal Kalchbrenner and Phil Blunsom (2013), Recurrent Continuous Translation Models, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1700-1709, Seattle, USA.
[30] Grzegorz Kondrak, Daniel Marcu, and Kevin Knight (2003), Cognates can improve statistical translation models, in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003-short papers-Volume 2, pp. 46-48, Association for Computational Linguistics, Edmonton, Canada.
[31] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao (2015), Recurrent Convolutional Neural Networks for Text Classification, in Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), Austin, TX, USA.
[32] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou (2018), Close to human quality TTS with transformer, arXiv preprint arXiv:1809.08895.
[33] Minh-Thang Luong and Christopher D. Manning (2016), Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models, Computing Research Repository (CoRR), abs/1604.00788.
[34] Thang Luong, Hieu Pham, and Christopher D. Manning (2015a), Effective Approaches to Attention-based Neural Machine Translation, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1412-1421, Lisbon, Portugal.
[35] Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba (2015b), Addressing the Rare Word Problem in Neural Machine Translation, in Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp. 11-19, Beijing, China.
[36] B. P. Mahapatra, P. Padmanabha, G. D. McConnell, and V. S. Verma (1989), The Written Languages of the World: A Survey of the Degree and Modes of Use: Volume 2: India: Book 1 Constitutional Languages, volume 1, Quebec: Presses de l’Université Laval; Forest Grove, Oregon.
[37] Gideon S Mann and David Yarowsky (2001), Multipath translation lexicon induction via bridge languages, in Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp. 1-8, Association for Computational Linguistics, Pittsburgh, PA, USA.
[38] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean (2013), Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems (NIPS), pp. 3111-3119.
[39] Diwakar Mishra and Kalika Bali (2011), A Comparative Phonological Study of the Dialects of Hindi, in Proceedings of the International Congress of Phonetic Sciences (ICPhS), pp. 1390-1393, Hong Kong.
[40] Preslav Nakov and Jörg Tiedemann (2012), Combining word-level and character-level models for machine translation between closely-related languages, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 301-305, Association for Computational Linguistics (ACL), Jeju Island, Korea.
[41] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002), BLEU: a method for automatic evaluation of machine translation, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311-318, Association for Computational Linguistics, Pennsylvania, PA, USA.
[42] Jeffrey Pennington, Richard Socher, and Christopher Manning (2014), Glove: Global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, Doha, Qatar.
[43] Ngoc-Quan Pham, Jan Niehues, and Alexander H. Waibel (2018), Towards one-shot learning for rare-word translation with external experts, in Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 56th Annual Meeting of the Association for Computational Linguistics, pp. 100-109, Melbourne, Australia.
[44] Martin Popel and Ondřej Bojar (2018), Training Tips for the Transformer Model, The Prague Bulletin of Mathematical Linguistics, 110 (1): 43-70.
[45] Maja Popović (2015), chrF: character n-gram F-score for automatic MT evaluation, in Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392-395, Lisbon, Portugal.
[46] Kevin P. Scannell (2006), Machine translation for closely related language pairs, in Proceedings of the Workshop Strategies for Developing Machine Translation for Minority Languages, pp. 103-109, Citeseer, Genoa, Italy.
[47] Rico Sennrich, Barry Haddow, and Alexandra Birch (2016), Neural Machine Translation of Rare Words with Subword Units, Computing Research Repository (CoRR), abs/1508.07909.
[48] Shashikant Sharma and Anil Kumar Singh (2017), Word Transduction for Addressing the OOV Problem in Machine Translation for Similar Resource-Scarce Languages, in Proceedings of the International Conference on Finite State Methods and Natural Language Processing (FSMNLP), pp. 56-63, Umeå, Sweden.
[49] Michel Simard, George F. Foster, and Pierre Isabelle (1993), Using cognates to align sentences in bilingual corpora, in Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing – Volume 2, pp. 1071-1082, IBM Press.
[50] Smriti Singh and Vaijayanthi M. Sarma (2010), Hindi noun inflection and Distributed Morphology, in Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, pp. 307-321.
[51] Srishti Singh and Girish Nath Jha (2015), Statistical Tagger for Bhojpuri (employing Support Vector Machine), in Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1524-1529, Kochi, India.
[52] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney (2012), LSTM neural networks for language modeling, in Proceedings of Thirteenth Annual Conference of the International Speech Communication Association, pp. 194-197, Portland, OR, USA.
[53] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le (2014a), Sequence to Sequence Learning with Neural Networks, in Advances in neural information processing systems (NIPS).
[54] Ilya Sutskever, Oriol Vinyals, and Quoc V Le (2014b), Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS), pp. 3104-3112.
[55] Jörg Tiedemann (2012), Character-based pivot translation for under-resourced languages and domains, in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 141-151, Association for Computational Linguistics, Avignon, France.
[56] Robert L. Trammell (1971), The Phonology of the Northern Standard Dialect of Bhojpuri, Anthropological Linguistics, 13 (4): 126-141.
[57] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li (2017), Neural Machine Translation with Reconstruction, in Association for the Advancement of Artificial Intelligence (AAAI), San Francisco, CA, USA.
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin (2017), Attention is All you Need, in Advances in Neural Information Processing Systems (NIPS).
[59] David Vilar, Jan-T. Peter, and Hermann Ney (2007), Can We Translate Letters?, in Proceedings of the Second Workshop on Statistical Machine Translation, StatMT’07, pp. 33-39, Association for Computational Linguistics, Stroudsburg, PA, USA, http://dl.acm.org/citation.cfm?id=1626355.1626360.
[60] Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton (2015), Grammar as a Foreign Language, in Advances in neural information processing systems (NIPS).
[61] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. (2016), Google’s Neural machine translation system: Bridging the gap between human and machine translation, Computing Research Repository (CoRR), abs/1609.08144.
[62] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy (2016), Hierarchical Attention Networks for Document Classification, in Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 1480-1489, San Diego, CA, USA.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-26438327-bdf9-4ba6-b8f9-26474c0c0f17