A practitioner’s view : a survey and comparison of lemmatization and morphological tagging in German and Latin

Gleim, Rüdiger; Eger, Steffen; Mehler, Alexander; Uslu, Tolga; Hemati, Wahed; Lücking, Andy; Henlein, Alexander; Kahlsdorf, Sven; Hoenen, Armin

doi:10.15398/jlm.v7i1.205

Artykuł - szczegóły

Tytuł artykułu

A practitioner’s view : a survey and comparison of lemmatization and morphological tagging in German and Latin

Autorzy

Gleim Rüdiger , Eger Steffen , Mehler Alexander , Uslu Tolga , Hemati Wahed , Lücking Andy , Henlein Alexander , Kahlsdorf Sven , Hoenen Armin

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.15398/jlm.v7i1.205

Warianty tytułu

Języki publikacji

Abstrakty

The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques.

Słowa kluczowe

morphological tagging lemmatization morphologically rich languages NLP evaluation modeling

Wydawca

Instytut Podstaw Informatyki PAN

Czasopismo

Journal of Language Modelling

Rocznik

2019

Tom

Vol. 7, No. 1

Strony

1--52

Opis fizyczny

Bibliogr. 52 poz., rys., tab., wykr.

Twórcy

autor

Gleim Rüdiger

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Eger Steffen

Ubiquitous Knowledge Processing Lab, Technische Universität Darmstadt, Germany

autor

Mehler Alexander

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Uslu Tolga

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Hemati Wahed

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Lücking Andy

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Henlein Alexander

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Kahlsdorf Sven

Text Technology Lab, Goethe University Frankfurt, Germany

autor

Hoenen Armin

Text Technology Lab, Goethe University Frankfurt, Germany

Bibliografia

[1] Bernd Bohnet and Joakim Nivre (2012), A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing, in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1455-1465, Association for Computational Linguistics, Jeju Island, Korea, http://www.aclweb.org/anthology/D12-1133.
[2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov (2017), Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics, 5:135-146, ISSN 2307-387X.
[3] Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit (2004), TIGER: Linguistic interpretation of a German corpus, Research on Language and Computation, 2 (4):597-620, doi: 10.1007/s11168-004-7431-3.
[4] Thorsten Brants (2000), TnT: A statistical part-of-speech tagger, in Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC ’00, pp. 224-231, Association for Computational Linguistics, Stroudsburg, PA, USA, doi: 10.3115/974147.974178, http://dx.doi.org/10.3115/974147.974178.
[5] Jacob Cohen (1960), A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20:37-46.
[6] Paul Compton and Bob Jansen (1988), Knowledge in context: A strategy for expert system maintenance, Proceedings of the 2nd Australian Joint Artificial Intelligence Conference, pp. 292-306.
[7] Gregory Crane (1991), Generating and parsing classical Greek, Literary and Linguistic Computing, 6 (4):243-245, doi: 10.1093/llc/6.4.243, http://dx.doi.org/10.1093/llc/6.4.243.
[8] Hannes Dohrn and Dirk Riehle (2011), Design and implementation of the Sweble Wikitext Parser: Unlocking the structured data of Wikipedia, in Proceedings of the 7th International Symposium on Wikis and Open Collaboration, WikiSym ’11, pp. 72-81, ACM, New York, NY, USA, ISBN 978-1-4503-0909-7, doi: 10.1145/2038558.2038571, http://doi.acm.org/10.1145/2038558.2038571.
[9] Markus Dreyer, Jason Smith, and Jason Eisner (2008), Latent-variable modeling of string transductions with finite-state methods, in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 1080-1089, Association for Computational Linguistics, Honolulu, Hawaii, http://www.aclweb.org/anthology/D08-1113.
[10] Greg Durrett and John DeNero (2013), Supervised learning of complete morphological paradigms, in Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 1185-1195.
[11] Steffen Eger (2015), Designing and comparing G2P-type lemmatizers for a morphology-rich language, in Systems and Frameworks for Computational Morphology – Fourth International Workshop, SFCM 2015, Stuttgart, Germany, September 17-18, 2015, Proceedings, pp. 27-40, doi: 10.1007/978-3-319-23980-4_2.
[12] Steffen Eger, Tim vor der Brück, and Alexander Mehler (2015), Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods, in Proceedings of the 9th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2015), pp. 105-113, Beijing, China.
[13] Joseph L. Fleiss (1971), Measuring nominal scale agreement among many raters, Psychological Bulletin, 76:378-382.
[14] Andrea Gesmundo and Tanja Samardzic (2012), Lemmatisation as a tagging task, in The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea – Volume 2: Short Papers, pp. 368-372.
[15] Birgit Hamp and Helmut Feldweg (1997), GermaNet – a lexical-semantic net for German, in In Proceedings of ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pp. 9-15.
[16] Dag Trygve Truslew Haug and Marius Jøhndal (2008), Creating a paralel treebank of the old Indo-European Bible translations, in Proceedings of the Sixth International Language Resources and Evaluation (LREC’08).
[17] Wahed Hemati, Tolga Uslu, and Alexander Mehler (2016), TextImager: a Distributed UIMA-based system for NLP, in Proceedings of the COLING 2016 System Demonstrations, pp. 59-63, Federated Conference on Computer Science and Information Systems.
[18] Verena Henrich and Erhard Hinrichs (2010), GernEdiT – The GermaNet editing tool, in Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta, ISBN 2-9517408-6-7.
[19] Mans Hulden, Markus Forsberg, and Malin Ahlberg (2014), Semi-supervised learning of morphological paradigms and lexicons, in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 569-578, Association for Computational Linguistics, Gothenburg, Sweden, doi:10.3115/v1/E14-1060, https://www.aclweb.org/anthology/E14-1060.
[20] Matjaž Juršič, Igor Mozetič, Tomaž Erjavec, and Nada Lavrač (2010), LemmaGen: Multilingual lemmatisation with induced ripple-down rules., J. UCS, 16 (9):1190-1214, http://dblp.unitrier.de/db/journals/jucs/jucs16.html#JursicMEL10.
[21] Bernhard Jussen, Alexander Mehler, and Alexandra Ernst (2007), A corpus management system for historical semantics, Sprache und Datenverarbeitung. International Journal for Language Data Processing, 31 (1-2):81-89.
[22] Alexandros Komninos and Suresh Manandhar (2016), Dependency based embeddings for sentence classification tasks, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1490-1500, Association for Computational Linguistics, San Diego, California, doi:10.18653/v1/N16-1175, https://www.aclweb.org/anthology/N16-1175.
[23] Daniel Kondratyuk, Tomás Gavenciak, Milan Straka, and Jan Hajic (2018), LemmaTag: Jointly tagging and lemmatizing for morphologically-rich languages with BRNNs, CoRR, abs/1808.03703:4921-4928.
[24] Klaus Krippendorff (1980), Content analysis, volume 5 of The SAGE KommTexT Series, SAGE Publications, Beverly Hills and London.
[25] Matthieu Labeau, Kevin Löser, and Alexandre Allauzen (2015), Non-lexical neural architecture for fine-grained pos tagging, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 232-237, Association for Computational Linguistics, Lisbon, Portugal, http://aclweb.org/anthology/D15-1025.
[26] Omer Levy and Yoav Goldberg (2014), Dependency-based word embeddings, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 2, pp. 302-308.
[27] Andy Lücking, Armin Hoenen, and Alexander Mehler (2016), TGermaCorp – A (digital) humanities resource for (computational) linguistics, in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, pp. 4271-4277.
[28] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky (2014), The Stanford CoreNLP natural language processing toolkit, in Association for Computational Linguistics (ACL) System Demonstrations, pp. 55-60, http://www.aclweb.org/anthology/P/P14/P14-5010.
[29] Barbara McGillivray, Marco Passarotti, and Paolo Ruffolo (2009), The Index Thomisticus treebank project: Annotation, parsing and valency lexicon, TAL, 50 (2):103-127, http://atala.org/IMG/pdf/TAL-2009-50-2-04-McGillivray.pdf.
[30] Alexander Mehler, Tim vor der Brück, Rüdiger Gleim, and Tim Geelhaar (2015), Towards a network model of the coreness of texts: An experiment in classifying Latin texts using the TTLab Latin tagger, in Chris Biemann and Alexander Mehler, editors, Text Mining: From Ontology Learning to Automated text Processing Applications, Theory and Applications of Natural Language Processing, pp. 87-112, Springer, Berlin/New York.
[31] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean (2013), Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems 26, pp. 3111-3119, Curran Associates, Inc., http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
[32] Stefano Minozzi (2008), La costruzione di una base di conoscenza lessicale per la lingua latina: LatinWordnet, http://hdl.handle.net/11562/324939.
[33] Thomas Müller, Ryan Cotterell, Alexander M. Fraser, and Hinrich Schütze (2015), Joint lemmatization and morphological tagging with Lemming, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 2268-2274.
[34] Thomas Müller, Helmut Schmid, and Hinrich Schütze (2013), Efficient higher-order CRFs for morphological tagging, in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 322-332, Association for Computational Linguistics, Seattle, Washington, USA, http://www.aclweb.org/anthology/D13-1032.
[35] Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham (2014), RDRPOSTagger: A ripple down rules-based part-of-speech tagger, in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 17-20, Association for Computational Linguistics, Gothenburg, Sweden, http://www.aclweb.org/anthology/E14-2005.
[36] Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham (2016), A robust transformation-based learning approach using ripple down rules for part-of-speech tagging, AI Communications, 29 (3):409-422, http://dx.doi.org/10.3233/AIC-150698.
[37] Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak (2015), Inflection generation as discriminative string transduction, in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 922-931, Association for Computational Linguistics, Denver, Colorado, http://www.aclweb.org/anthology/N15-1093.
[38] Marco Passarotti (2015), What you can do with linguistically annotated data. From the Index Thomisticus to the Index Thomisticus Treebank, in Reading Sacred Scripture with Thomas Aquinas. Hermeneutical Tools, Theological Questions and New Perspectives, pp. 3-44, Brepols.
[39] Marco Passarotti, Marco Budassi, Eleonora Litta, and Paolo Ruffolo (2017), The Lemlat 3.0 package for morphological analysis of Latin, in Proceedings of the NoDaLiDa 2017 workshop on processing historical language, 133, pp. 24-31, Linköping University Electronic Press, Linköpings universitet, ISSN 1650-3740.
[40] Jeffrey Pennington, Richard Socher, and Christopher D. Manning (2014), GloVe: Global vectors for word representation, in Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, http://www.aclweb.org/anthology/D14-1162.
[41] Adwait Ratnaparkhi (1996), A maximum entropy model for part-of-speech tagging, in Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, Pennsylvania.
[42] Toni Rietveld and Roeland van Hout (1993), Statistical techniques for the study of language and language behaviour, Mouton de Gruyter, Amsterdam.
[43] Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen (1999), Guidelines für das Tagging deutscher Textcorpora mit STTS (kleines und großes Tagset), Technical report, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
[44] Helmut Schmid (1994), Probabilistic part-of-speech tagging using decision trees, in International Conference on New Methods in Language Processing, pp. 44-49, Manchester, UK.
[45] Tobias Schnabel and Hinrich Schütze (2014), FLORS: Fast and simple domain adaptation for part-of-speech tagging, TACL, 2:15-26, https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/183.
[46] Carsten Schnober, Steffen Eger, Erik-Lân Do Dinh, and Iryna Gurevych (2016), Still not there? Comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1703-1714, The COLING 2016 Organizing Committee, Osaka, Japan.
[47] Uwe Springmann, Helmut Schmid, and Dietmar Najock (2016), LatMor: A Latin finite-state morphology encoding vowel quantity, Open Linguistics, 2 (1): 386-392, doi: 10.1515/opli-2016-0019.
[48] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer (2003), Feature-rich part-of-speech tagging with a cyclic dependency network, in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology – Volume 1, NAACL ’03, pp. 173-180, Association for Computational Linguistics, Stroudsburg, PA, USA, doi: 10.3115/1073445.1073478, http://dx.doi.org/10.3115/1073445.1073478.
[49] Yoshimasa Tsuruoka, Yusuke Miyao, and Jun’ichi Kazama (2011), Learning with lookahead: Can history-based models rival globally optimized models?, in Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL ’11, pp. 238-246, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-932432-92-3, http://dl.acm.org/citation.cfm?id=2018936.2018964.
[50] Tim vor der Brück and Alexander Mehler (2016), TLT-CRF: A lexicon-supported morphological tagger for Latin based on conditional random fields, in Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France, ISBN 978-2-9517408-9-1.
[51] Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao (2015), Part-of-speech tagging with bidirectional long short-term memory recurrent neural network, CoRR, abs/1510.06168, http://arxiv.org/abs/1510.06168.
[52] Wenpeng Yin, Tobias Schnabel, and Hinrich Schütze (2015), Online updating of word representations for part-of-speech tagging, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1329-1334, Association for Computational Linguistics, Lisbon, Portugal, http://aclweb.org/anthology/D15-1155.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-2dd05fba-6bc3-46f7-aaad-826f32f78f15