PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Data-oriented parsing with discontinuous constituents and function tags

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Statistical parsers are effective but are typically limited to producing projective dependencies or constituents. On the other hand, linguistically rich parsers recognize non-local relations and analyze both form and function phenomena but rely on extensive manual grammar engineering. We combine advantages of the two by building a statistical parser that produces richer analyses. We investigate new techniques to implement treebank-based parsers that allow for discontinuous constituents. We present two systems. One system is based on a Linear Context-Free Rewriting System (LCFRS), while using a Probabilistic Discontinuous Tree-Substitution Grammar (PDTSG) to improve disambiguation performance. Another system encodes discontinuities in the labels of phrase-structure trees, allowing for efficient context-free grammar parsing. The two systems demonstrate that tree fragments as used in treesubstitution grammar improve disambiguation performance Chile capturing non-local relations on an as-needed basis. Additionally, we present results for models that produce function tags, resulting in a more linguistically adequate model of the data. We report substantial accuracy improvements in discontinuous parsing for German, English, and Dutch, including results on spoken Dutch.
Rocznik
Strony
57--111
Opis fizyczny
Bibliogr. 102 poz., rys., tab., wykr.
Twórcy
  • Huygens ING, Royal Netherlands Academy of Arts and Sciences
  • Institute for Logic, Language and Computation, University of Amsterdam
autor
  • Institute for Logic, Language and Computation, University of Amsterdam
autor
  • Institute for Logic, Language and Computation, University of Amsterdam
Bibliografia
  • [1] Krasimir Angelov and Peter Ljunglöf (2014), Fast statistical parsing with parallel multiple context-free grammars, in Proceedings of EACL, pp. 368-376, http://aclweb.org/anthology/E14-1039.
  • [2] Mohit Bansal and Dan Klein (2010), Simple, accurate parsing with an all-fragments grammar, in Proceedings of ACL, pp. 1098-1107, http://aclweb.org/anthology/P10-1112.
  • [3] François Barthélemy, Pierre Boullier, Philippe Deschamp, and Éric de la Clergerie (2001), Guided parsing of range concatenation languages, in Proceedings of ACL, pp. 42-49, http://aclweb.org/anthology/P01-1007.
  • [4] Shane Bergsma, Matt Post, and David Yarowsky (2012), Stylometric analysis of scientific articles, in Proceedings of NAACL, pp. 327-337, http://aclweb.org/anthology/N12-1033.
  • [5] Ezra Black, John Lafferty, and Salim Roukos (1992), Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals, in Proceedings of ACL, pp. 185-192, http://aclweb.org/anthology/P92-1024.
  • [6] Don Blaheta and Eugene Charniak (2000), Assigning function tags to parsed text, in Proceedings of NAACL, pp. 234-240, http://aclweb.org/anthology/A00-2031.
  • [7] Rens Bod (1992), A computational model of language performance: data-oriented parsing, in Proceedings COLING, pp. 855-859, http://aclweb.org/anthology/C92-3126.
  • [8] Rens Bod (1995), The problem of computing the most probable tree in data-oriented parsing and stochastic tree grammars, in Proceedings of EACL, pp. 104-111, http://aclweb.org/anthology/E95-1015.
  • [9] Rens Bod (2001), What is the minimal set of fragments that achieves maximal parse accuracy?, in Proceedings of ACL, pp. 69-76, http://aclweb.org/anthology/P01-1010.
  • [10] Rens Bod, Remko Scha, and Khalil Sima’an, editors (2003), Data-Oriented Parsing, The University of Chicago Press.
  • [11] Pierre Boullier (1998), Proposal for a natural language processing syntactic backbone, Technical Report RR-3342, inria-Rocquencourt, Le Chesnay, France, http://www.inria.fr/RRRT/RR-3342.html.
  • [12] Adriane Boyd (2007), Discontinuity revisited: An improved conversion to context-free representations, in Proceedings of the Linguistic Annotation Workshop, pp. 41-44, http://aclweb.org/anthology/W07-1506.
  • [13] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith (2002), The Tiger treebank, in Proceedings of the workshop on treebanks and linguistic theories, pp. 24-41, http://www.bultreebank.org/proceedings/paper03.pdf.
  • [14] Joan Bresnan, Ronald M. Kaplan, Stanley Peters, and Annie Zaenen (1982), Cross-serial dependencies in Dutch, Linguistic Inquiry, 13 (4): 613-635.
  • [15] Shu Cai, David Chiang, and Yoav Goldberg (2011), Language-independent parsing with empty elements, in Proceedings of ACL-HLT, pp. 212-216, http://aclweb.org/anthology/P11-2037.
  • [16] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin (2015), Better bitmap performance with Roaring bitmaps, Software: Practice and Experience, ISSN 1097-024X, doi: 10.1002/spe.2325, http://arxiv.org/abs/1402.6407, to appear.
  • [17] Eugene Charniak (1996), Tree-bank grammars, in Proceedings of the National Conference on Artificial Intelligence, pp. 1031-1036.
  • [18] Eugene Charniak, Mark Johnson, M. Elsner, J. Austerweil, D. Ellis, I. Haxton, C. Hill, R. Shrivaths, J. Moore, M. Pozar, et al. (2006), Multilevel coarse-to-fine PCFG parsing, in Proceedings of NAACL-HLT, pp. 168-175, http://aclweb.org/anthology/N06-1022.
  • [19] David Chiang (2000), Statistical parsing with an automatically-extracted tree adjoining grammar, in Proceedings of ACL, pp. 456-463, http://aclweb.org/anthology/P00-1058.
  • [20] Noam Chomsky (1956), Three models for the description of language, IRE Transactions on Information Theory, 2 (3): 113-124.
  • [21] Noam Chomsky (1965), Aspects of the Theory of Syntax, MIT press.
  • [22] Trevor Cohn, Phil Blunsom, and Sharon Goldwater (2010), Inducing tree-substitution grammars, The Journal of Machine Learning Research, 11(Nov): 3053-3096.
  • [23] Trevor Cohn, Sharon Goldwater, and Phil Blunsom (2009), Inducing compact but accurate tree-substitution grammars, in Proceedings of NAACL-HLT, pp. 548-556, http://aclweb.org/anthology/N09-1062.
  • [24] Michael Collins (1999), Head-driven statistical models for natural language parsing, Ph.D. thesis, University of Pennsylvania.
  • [25] Peter Dienes and Amit Dubey (2003), Deep syntactic processing by combining shallow methods, in Proceedings of ACL, pp. 431-438, http://aclweb.org/anthology/P03-1055.
  • [26] Amit Dubey and Frank Keller (2003), Probabilistic parsing for German Rusing sister-head dependencies, in Proceedings of ACL, pp. 96-103, http://aclweb.org/anthology/P03-1013.
  • [27] Kilian Evang and Laura Kallmeyer (2011), PLCFRS parsing of English discontinuous constituents, in Proceedings of IWPT, pp. 104-116, http://aclweb.org/anthology/W11-2913.
  • [28] Daniel Fernández-González and André F. T. Martins (2015), Parsing as reduction, in Proceedings of ACL, pp. 1523-1533, http://aclweb.org/anthology/P15-1147.
  • [29] Alexander Fraser, Helmut Schmid, Richárd Farkas, Renjing Wang, and Hinrich Schütze (2013), Knowledge sources for constituent parsing of German, a morphologically rich and less-configurational language, Computational Linguistics, 39 (1): 57-85, http://aclweb.org/anthology/J13-1005.
  • [30] Ryan Gabbard, Mitchell Marcus, and Seth Kulick (2006), Fully parsing the Penn treebank, in Proceedings of NAACL-HLT, pp. 184-191, http://aclweb.org/anthology/N06-1024.
  • [31] Stuart Geman and Mark Johnson (2004), Probability and statistics in computational linguistics, a brief review, in Mark Johnson, Sanjeev P. Khudanpur, Mari Ostendorf, and Roni Rosenfeld, editors, Mathematical foundations of speech and language processing, pp. 1-26, Springer.
  • [32] Daniel Gildea (2010), Optimal parsing strategies for linear context-free rewriting systems, in Proceedings of NAACL-HLT, pp. 769-776, http://aclweb.org/anthology/N10-1118.
  • [33] Joshua Goodman (2003), Efficient parsing of DOP with PCFG-reductions, in Bod et al. (2003), pp. 125-146.
  • [34] Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning (2011), Multiword expression identification with tree substitution grammars: A parsing tour de force with French, in Proceedings of EMNLP, pp. 725-735, http://aclweb.org/anthology/D11-1067.
  • [35] Johan Hall and Joakim Nivre (2008), Parsing discontinuous phrase structure with grammatical functions, in Bengt Nordström and Aarne Ranta, editors, Advances in Natural Language Processing, volume 5221 of Lecture Notes in Computer Science, pp. 169-180, Springer, http://dx.doi.org/10.1007/978-3-540-85287-2_17.
  • [36] Lars Hoogweg (2003), Extending DOP with insertion, in Bod et al. (2003), pp. 317-335.
  • [37] Yu-Yin Hsu (2010), Comparing conversions of discontinuity in PCFG parsing, in Proceedings of Treebanks and Linguistic Theories, pp. 103-113, http://hdl.handle.net/10062/15954.
  • [38] Liang Huang and David Chiang (2005), Better k-best parsing, in Proceedings of IWPT, pp. 53-64, NB corrected version on author homepage: http://www.cis.upenn.edu/~lhuang3/huang-iwpt-correct.pdf.
  • [39] Marinus A. C. Huybregts (1976), Overlapping dependencies in Dutch, Utrecht Working Papers in Linguistics, 1: 24-65.
  • [40] Mark Johnson (2002), A simple pattern-matching algorithm for recovering empty nodes and their antecedents, in Proceedings of ACL, pp. 136-143, http://aclweb.org/anthology/P02-1018.
  • [41] Aravind K. Joshi (1985), How much context sensitivity is necessary for characterizing structural descriptions: Tree adjoining grammars, in David R. Dowty, Lauri Karttunen, and Arnold M. Zwicky, editors, Natural language parsing: Psychological, computational and theoretical perspectives, pp. 206-250, Cambridge University Press, New York.
  • [42] Miriam Kaeshammer and Vera Demberg (2012), German and English treebanks and lexica for tree-adjoining grammars, in Proceedings of LREC, pp. 1880-1887, http://www.lrec-conf.org/proceedings/lrec2012/pdf/398_Paper.pdf.
  • [43] Laura Kallmeyer (2009), A declarative characterization of different types of multicomponent tree adjoining grammars, Research on Language and Computation, 7 (1): 55-99.
  • [44] Laura Kallmeyer (2010), Parsing Beyond Context-Free Grammars, Cognitive Technologies, Springer.
  • [45] Laura Kallmeyer and Wolfgang Maier (2010), Data-driven parsing with probabilistic linear context-free rewriting systems, in Proceedings of COLING, pp. 537-545, http://aclweb.org/anthology/C10-1061.
  • [46] Laura Kallmeyer and Wolfgang Maier (2013), Data-driven parsing Rusing probabilistic linear context-free rewriting systems, Computational Linguistics, 39 (1): 87-119, http://aclweb.org/anthology/J13-1006.
  • [47] Laura Kallmeyer, Wolfgang Maier, and Giorgio Satta (2009), Synchronous rewriting in treebanks, in Proceedings of IWPT, http://aclweb.org/anthology/W09-3810.
  • [48] Fred Karlsson (2007), Constraints on multiple centre-embedding of clauses, Journal of Linguistics, 43 (2): 365-392.
  • [49] Dan Klein and Christopher D. Manning (2003), Accurate unlexicalized parsing, in Proceedings of ACL, volume 1, pp. 423-430, http://aclweb.org/anthology/P03-1054.
  • [50] Marco Kuhlmann (2013), Mildly non-projective dependency grammar, Computational Linguistics, 39 (2): 355-387, http://aclweb.org/anthology/J13-2004.
  • [51] Marco Kuhlmann and Giorgio Satta (2009), Treebank grammar techniques for non-projective dependency parsing, in Proceedings of EACL, pp. 478-486, http://aclweb.org/anthology/E09-1055.
  • [52] Roger Levy (2005), Probabilistic models of word order and syntactic discontinuity, Ph.D. thesis, Stanford University.
  • [53] Roger Levy and Christopher D. Manning (2004), Deep dependencies from context-free statistical parsers: correcting the surface dependenci approximation, in Proceedings of ACL, pp. 327-334, http://aclweb.org/anthology/P04-1042.
  • [54] Wolfgang Maier, Miriam Kaeshammer, Peter Baumann, and Sandra Kübler (2014), Discosuite – A parser test suite for German discontinuous structures, in Proceedings of LREC, http://www.lrec-conf.org/proceedings/lrec2014/pdf/230_Paper.pdf.
  • [55] Wolfgang Maier, Miriam Kaeshammer, and Laura Kallmeyer (2012), PLCFRS parsing revisited: Restricting the fan-out to two, in Proceedings of TAG, volume 11, http://wolfgang-maier.net/pub/tagplus12.pdf.
  • [56] Wolfgang Maier and Timm Lichte (2011), Characterizing discontinuity in constituent treebanks, in Proceedings of Formal Grammar 2009, pp. 167-182, Springer.
  • [57] Wolfgang Maier and Anders Søgaard (2008), Treebanks and mild context-sensitivity, in Proceedings of Formal Grammar 2008, pp. 61-76.
  • [58] James D. McCawley (1982), Parentheticals and discontinuous constituent structure, Linguistic Inquiry, 13 (1): 91-106, http://www.jstor.org/stable/4178261.
  • [59] Mark-Jan Nederhof and Heiko Vogler (2014), Hybrid grammars for discontinuous parsing, in Proceedings of COLING, pp. 1370-1381, http://aclweb.org/anthology/C14-1130.
  • [60] Timothy J. O’Donnell, Joshua B. Tenenbaum, and Noah D. Goodman (2009), Fragment grammars: Exploring computation and reuse in language, Technical Report MIT-CSAIL-TR-2009-013, MIT CSAIL, http://hdl.handle.net/1721.1/44963.
  • [61] Almerindo E. Ojeda (1988), A linear precedence account of cross-serial dependencies, Linguistics and Philosophy, 11 (4): 457-492.
  • [62] Adam Pauls and Dan Klein (2009), Hierarchical search for parsing, in Proceedings of NAACL-HLT, pp. 557-565, http://aclweb.org/anthology/N09-1063.
  • [63] P. Stanley Peters and R. W. Ritchie (1973), On the generative power of transformational grammars, Information Sciences, 6: 49-83, http://dx.doi.org/10.1016/0020-0255(73)90027-3.
  • [64] Slav Petrov (2010), Products of random latent variable grammars, in Proceedings of NAACL-HLT, pp. 19-27, http://aclweb.org/anthology/N10-1003.
  • [65] Kenneth L. Pike (1943), Taxemes and immediate constituents, Language, 19 (2): 65-82, http://www.jstor.org/stable/409840.
  • [66] Matt Post (2011), Judging grammaticality with tree substitution grammar derivations, in Proceedings of the ACL-HLT 2011, pp. 217-222, http://aclweb.org/anthology/P11-2038.
  • [67] Matt Post and Daniel Gildea (2009), Bayesian learning of a tree substitution grammar, in Proceedings of the ACL-IJCNLP 2009 Conference, Short Papers, pp. 45-48, http://aclweb.org/anthology/P09-2012.
  • [68] Brian Roark, Kristy Hollingshead, and Nathan Bodenstab (2012), Finite-state chart constraints for reduced complexity context-free parking pipelines, Computational Linguistics, 38 (4): 719-753, http://aclweb.org/anthology/J12-4002.
  • [69] Federico Sangati and Willem Zuidema (2011), Accurate parsing with compact tree-substitution grammars: Double-DOP, in Proceedings of EMNLP, pp. 84-95, http://aclweb.org/anthology/D11-1008.
  • [70] Federico Sangati, Willem Zuidema, and Rens Bod (2010), Efficiently extract recurring tree fragments from large treebanks, in Proceedings of LREC, pp. 219-226, http://dare.uva.nl/record/371504.
  • [71] Remko Scha (1990), Language theory and language technology; competence and performance, in Q. A. M. de Kort and G. L. J. Leerdam, editors, Computertoepassingen in de Neerlandistiek, pp. 7-22, LVVN, Almere, the Netherlands, original title: Taaltheorie en taaltechnologie; competence en performance. English translation: http://iaaa.nl/rs/LeerdamE.html.
  • [72] Yves Schabes and Richard C. Waters (1995), Tree insertion grammar: cubic-time, parsable formalism that lexicalizes context-free grammar without changing the trees produced, Computational Linguistics, 21 (4): 479-513, http://aclweb.org/anthology/J95-4002.
  • [73] Helmut Schmid (2004), Efficient parsing of highly ambiguous context-free grammars with bit vectors, in Proceedings of COLING ’04, http://aclweb.org/anthology/C04-1024.
  • [74] Helmut Schmid (2006), Trace prediction and recovery with unlexicalized PCFGs and slash features, in Proceedings of COLING-ACL, pp. 177-184, http://aclweb.org/anthology/P06-1023.
  • [75] William Schuler, Samir AbdelRahman, Tim Miller, and Lane Schwartz (2010), Broad-coverage parsing using human-like memory constraints, Computational Linguistics, 36 (1): 1-30, http://aclweb.org/anthology/J10-1001.
  • [76] William Schuler, David Chiang, and Mark Dras (2000), Multi-component TAG and notions of formal power, in Proceedings of ACL, pp. 448-455, http://aclweb.org/anthology/P00-1057.
  • [77] Hiroyuki Seki, Takahashi Matsumura, Mamoru Fujii, and Tadao Kasami (1991), On multiple context-free grammars, Theoretical Computer Science, 88 (2): 191-229.
  • [78] Stuart M. Shieber (1985), Evidence against the context-freeness of natural language, Linguistics and Philosophy, 8: 333-343.
  • [79] Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata (2012), Bayesian symbol-refined tree substitution grammars for syntactic parsing, in Proceedings of ACL, pp. 440-448, http://aclweb.org/anthology/P12-1046.
  • [80] Khalil Sima’an (1997), Efficient Disambiguation by means of stochastic tree substitution grammars, in D. Jones and H. Somers, editors, New Methods in Language Processing, pp. 178-198, UCL Press, UK.
  • [81] Wojciech Skut, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit (1997), An annotation scheme for free word order languages, in Proceedings of ANLP, pp. 88-95, http://aclweb.org/anthology/A97-1014.
  • [82] Ben Swanson, Elif Yamangil, Eugene Charniak, and Stuart Shieber (2013), A context free TAG variant, in Proceedings of the ACL, pp. 302-310, http://aclweb.org/anthology/P13-1030.
  • [83] Benjamin Swanson and Eugene Charniak (2012), Native language detection with tree substitution grammars, in Proceedings of ACL, pp. 193-197, http://aclweb.org/anthology/P12-2038.
  • [84] Heike Telljohann, Erhard Hinrichs, and Sandra Kübler (2004), The Tüba-D/Z Treebank: Annotating German with a context-free backbone, in Proceedings of LREC, pp. 2229-2235, http://www.lrec-conf.org/proceedings/lrec2004/pdf/135.pdf.
  • [85] Heike Telljohann, Erhard W Hinrichs, Sandra Kübler, Heike Zinsmeister, and Kathrin Beck (2012), Stylebook for the Tübingen treebank of written German (TüBa-D/Z), technical report, Seminar für Sprachwissenschaft, Universität Tübingen, Germany, http://www.sfs.uni-tuebingen.de/fileadmin/static/ascl/resources/tuebadz-stylebook-1201.pdf.
  • [86] Marten H. Trautwein (1995), Computational pitfalls in tractable gram mar formalisms, Ph.D. thesis, University of Amsterdam, http://www.illc.uva.nl/Research/Publications/Dissertations/DS-1995-15.text.ps.gz.
  • [87] Andreas van Cranenburgh (2012a), Efficient parsing with linear context-free rewriting systems, in Proceedings of EACL, pp. 460-470, corrected version: http://andreasvc.github.io/eacl2012corrected.pdf.
  • [88] Andreas van Cranenburgh (2012b), Literary authorship attribution with phrase-structure fragments, in Proceedings of CLFL, pp. 59-63, revised version: http://andreasvc.github.io/clfl2012.pdf.
  • [89] Andreas van Cranenburgh (2014), Extraction of phrase-structure fragments with a linear average time tree kernel, Computational Linguistics in the Netherlands Journal, 4: 3-16, ISSN 2211-4009, http://www.clinjournal.org/sites/default/files/01-Cranenburgh-CLIN2014.pdf.
  • [90] Andreas van Cranenburgh and Rens Bod (2013), Discontinuous parking with an efficient and accurate DOP model, in Proceedings of IWPT, pp. 7-16, http://www.illc.uva.nl/LaCo/CLS/papers/iwpt2013parser_final.pdf.
  • [91] Andreas van Cranenburgh, Remko Scha, and Federico Sangati (2011), Discontinuous data-oriented parsing: A mildly context-sensitive all-fragments grammar, in Proceedings of SPMRL, pp. 34-44, http://aclweb.org/anthology/W11-3805.
  • [92] Leonoor van der Beek, Gosse Bouma, Robert Malouf, and Gertjan van Noord (2002), The Alpino dependency treebank, Language and Computers, 45 (1): 8-22.
  • [93] van der Wouden, Heleen Hoekstra, Michael Moortgat, Bram Renmans, and Ineke Schuurman (2002), Syntactic analysis in the spoken Dutch corpus (CGN), in Proceedings of LREC, pp. 768-773, http://www.lrec-conf.org/proceedings/lrec2002/pdf/71.pdf.
  • [94] Gertjan Van Noord (2009), Huge parsed corpora in Lassy, in Proceedings of TLT7, LOT, Groningen, The Netherlands.
  • [95] Yannick Versley (2014), Experiments with easy-first nonprojective constituent parsing, in Proceedings of SPMRL-SANCL 2014, pp. 39-53, http://aclweb.org/anthology/W14-6104.
  • [96] K. Vijay-Shanker and David J. Weir (1994), The equivalence of four extensions of context-free grammars, Theory of Computing Systems, 27 (6): 511-546.
  • [97] K. Vijay-Shanker, David J. Weir, and Aravind K. Joshi (1987), Characterizing structural descriptions produced by various grammatical formalisms, in Proceedings of ACL, pp. 104-111, http://aclweb.org/anthology/P87-1015.
  • [98] David J. Weir (1988), Characterizing mildly context-sensitive gram mar formalisms, Ph.D. thesis, University of Pennsylvania, http://repository.upenn.edu/dissertations/AAI8908403/.
  • [99] Rulon S. Wells (1947), Immediate constituents, Language, 23 (2): 81-117, http://www.jstor.org/stable/410382.
  • [100] Fei Xia, Chung-Hye Han, Martha Palmer, and Aravind Joshi (2001), Automatically extracting and comparing lexicalized grammars for different languages, in Proceedings of IJCAI, pp. 1321-1330.
  • [101] Elif Yamangil and Stuart Shieber (2012), Estimating compact yet rich tree insertion grammars, in Proceedings of ACL, pp. 110-114, http://aclweb.org/anthology/P12-2022.
  • [102] Andreas Zollmann and Khalil Sima’an (2005), A consistent and efficient estimator for data-oriented parsing, Journal of Automata Languages and Combinatorics, 10 (2/3): 367-388, http://staff.science.uva.nl/~simaan/D-Papers/JALCsubmit.pdf.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-589cc3d9-a0e5-4d92-8cd2-a650ec0d65c7
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.