Design and analysis of a lean interface for Sanskrit corpus annotation

Goyal, P.; Huet, G.

doi:10.15398/jlm.v4i2.108

Artykuł - szczegóły

Tytuł artykułu

Design and analysis of a lean interface for Sanskrit corpus annotation

Autorzy

Goyal P. , Huet G.

Treść / Zawartość

Pełne teksty:

Goyal_Design and analysis of a lean interface_2_2016.pdf

Pobierz

Identyfikatory

DOI

10.15398/jlm.v4i2.108

Warianty tytułu

Języki publikacji

Abstrakty

We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexiconacquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.

Słowa kluczowe

Sanskrit text segmentation annotation interface

Wydawca

Instytut Podstaw Informatyki PAN

Czasopismo

Journal of Language Modelling

Rocznik

2016

Tom

Vol. 4, No. 2

Strony

145--182

Opis fizyczny

Bibliogr. 35 poz., rys., tab., wykr.

Twórcy

autor

Goyal P.

IIT Kharagpur, India

autor

Huet G.

INRIA Paris Laboratory, France

Bibliografia

[1] Kenneth R. Beesley and Lauri Karttunen (2003), Finite-state morphology: Xerox tools and techniques, CSLI Publications, The University of Chicago Press.
[2] Sylvie Billot and Bernard Lang (1989), The structure of shared forests in ambiguous parsing, in Proceedings of the 27th annual meeting on Association for Computational Linguistics, ACL ’89, pp. 143-151, Association for Computational Linguistics, Stroudsburg, PA, USA, doi: 10.3115/981623.981641.
[3] Keh-Jiann Chen and Shing-Huan Liu (1992), Word identification for Mandarin Chinese sentences, in Proceedings of the 14th conference on Computational linguistics-Volume 1, pp. 101-107, Association for Computational Linguistics.
[4] Jay Earley (1983), An efficient context-free parsing algorithm (reprint), Communications of the ACM - Special 25th Anniversary Issue, 26 (1): 57-61.
[5] Pawan Goyal, Vipul Arora, and Laxmidhar Behera (2009), Analysis of Sanskrit text: Parsing and semantic relations, in Gérard Huet, Amba Kulkarni, and Peter Scharf, editors, Sanskrit Computational Linguistics 1 & 2, pp. 200-218, Springer-Verlag LNAI 5402.
[6] Pawan Goyal and Gérard Huet (2013), Completeness analysis of a Sanskrit reader, in Malhar Kulkarni, editor, Recent Researches in Sanskrit Computational Linguistics (Proceedings, 5th International Symposium on Sanskrit Computational Linguistics), pp. 130-171, D.K. Printworld.
[7] Pawan Goyal, Gérard Huet, Amba Kulkarni, Peter Scharf, and Ralph Bunker (2012), A distributed platform for Sanskrit processing, in COLING, pp. 1011-1028.
[8] Oliver Hellwig (2009), SanskritTagger, a stochastic lexical and POS tagger for Sanskrit, in Gérard Huet, Amba Kulkarni, and Peter Scharf, editors, Sanskrit Computational Linguistics 1 & 2, pp. 266-277, Springer-Verlag LNAI 5402.
[9] Gérard Huet (2005), A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger, Journal of Functional Programming, 15, 4: 573-614, http://yquem.inria.fr/~huet/PUBLIC/tagger.pdf.
[10] Gérard Huet (2006), Lexicon-directed segmentation and tagging of Sanskrit, in Bertil Tikkanen and Heinrich Hettrich, editors, Themes and Tasks in Old and Middle Indo-Aryan Linguistics, pp. 307-325, Motilal Banarsidass.
[11] Gérard Huet (2007), Shallow syntax analysis in Sanskrit guided by semantic nets constraints, in Proceedings of the 2006 International Workshop on Research Issues in Digital Libraries, pp. 6: 1-6: 10, ACM, New York, NY, USA, doi: http://doi.acm.org/10.1145/1364742.1364750, http://yquem.inria.fr/~huet/PUBLIC/IWRIDL.pdf.
[12] Gérard Huet (2009), Formal structure of Sanskrit text: Requirements analysis for a mechanical Sanskrit processor, in Gérard Huet, Amba Kulkarni, and Peter Scharf, editors, Sanskrit Computational Linguistics. First and Second International Symposia Rocquencourt, France, October 29-31, 2007 Providence, RI, USA, May 15-17, 2008, pp. 162-199, Springer.
[13] Gérard Huet and Pawan Goyal (2013), Design of a lean interface for Sanskrit corpus annotation, in Proceedings of ICON 2013, the 10th International Conference on NLP, pp. 177-186.
[14] Gérard Huet and Amba Kulkarni (2014), Sanskrit linguistics web services, in COLING (Demo), pp. 48-51.
[15] Gérard Huet, Amba Kulkarni, and Peter Scharf, editors (2009), Sanskrit computational linguistics 1 & 2, Springer-Verlag LNAI 5402.
[16] Gérard Huet and Benoît Razet (2015), Computing with relational machines, Mathematical Structures in Computer Science, FirstView: 1-20, ISSN 1469-8072, doi: 10.1017/S0960129515000390, http://journals.cambridge.org/article_S0960129515000390.
[17] Girish Nath Jha, editor (2010), Sanskrit computational linguistics 4, Springer-Verlag LNAI 6465.
[18] Ronald M. Kaplan and Martin Kay (1994), Regular models of phonological rule systems, Computational Linguistics, 20, 3: 331-378.
[19] Amba Kulkarni and Gérard Huet, editors (2009), Sanskrit computational linguistics 3, Springer-Verlag LNAI 5406.
[20] Amba Kulkarni, Sheetal Pokar, and Devanand Shukl (2010), Designing a constraint based parser for Sanskrit, in Girish N. Jha, editor, Proceedings of the 4th International Sanskrit Computational Linguistics Symposium, pp. 70-90, Springer-Verlag LNAI 6465.
[21] Amba Kulkarni and K. V. Ramakrishnamacharyulu (2013), Parsing Sanskrit texts: Some relation specific issues, in Malhar Kulkarni, editor, Proceedings of the 5th International Sanskrit Computational Linguistics Symposium, pp. 191-212, D. K. Printworld(P) Ltd.
[22] Amba Kulkarni and Devanand Shukl (2009), Sanskrit morphological analyser: Some issues, Indian Linguistics, 70 (1-4): 169-177.
[23] Anil Kumar, Vipul Mittal, and Amba Kulkarni (2010), Sanskrit compound processor, in Girish N. Jha, editor, Proceedings of the 4th International Sanskrit Computational Linguistics Symposium, pp. 57-69, Springer-Verlag LNAI 6465.
[24] Monier Monier-Williams, Ernst Leumann, and Carl Cappeller (1899), A Sanskrit-English Dictionary: Etymological And philologically arranged with special reference to cognate Indo-European languages, Oxford, The Clarendon Press, http://www.sanskrit-lexicon.uni-koeln.de/scans/csldoc/dictionaries/mw.html.
[25] Emmanuel Roche and Yves Schabes (1997), Finite-State Language Processing, MIT Press.
[26] Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola (2010), On dual decomposition and linear programming relaxations for natural language processing, in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1-11, Association for Computational Linguistics.
[27] Peter Scharf, Anuja Ajotikar, Sampada Savardekar, and Pawan Goyal (2015), Distinctive features of poetic syntax preliminary results, Sanskrit syntax, pp. 305-324.
[28] Peter Scharf and Malcolm Hyman (2009), Linguistic issues in encoding Sanskrit, Motilal Banarsidass.
[29] Andreas Stolcke (1995), An efficient probabilistic context-free parsing algorithm that computes prefix probabilities, Computational Linguistics, 21 (2): 165-201.
[30] Weiwei Sun (2010), Word-based and character-based word segmentation models: Comparison and combination, in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 1211-1219, Association for Computational Linguistics.
[31] Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii (2009), A discriminative latent variable Chinese segmenter with hybrid word/character information, in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 56-64, Association for Computational Linguistics.
[32] Masaru Tomita (1985), Efficient parsing for natural language: A fast algorithm for practical systems, The Springer International Series in Engineering and Computer Science - Volume 8, Springer.
[33] Huihsin Tseng (2005), A conditional random field word segmenter, in Fourth SIGHAN Workshop on Chinese Language Processing.
[34] Mengqiu Wang, Rob Voigt, and Christopher D. Manning (2014), Two knives cut better than one: Chinese word segmentation with dual decomposition, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, MD.
[35] Yue Zhang and Stephen Clark (2007), Chinese segmentation with a word-based perceptron algorithm, in Annual Meeting of the Association for Computational Linguistics, volume 45, p. 840.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-965748c9-9df1-4d97-99ca-76ba809682da