PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

A French corpus annotated for multiword expressions and named entities

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with the syntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license.
Słowa kluczowe
Rocznik
Strony
415--479
Opis fizyczny
Bibliogr. 61 poz., rys., tab.
Twórcy
  • Université de Paris, CNRS, LLF, Paris, France
  • Université de Lorraine, CNRS, ATILF, Nancy, France
  • Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France
autor
  • Université de Tours, LIFAT, Tours, France
  • Université de Lorraine, CNRS, Inria, LORIA, Nancy, France
  • Université de Lorraine, CNRS, LORIA, LORIA, Nancy, France
  • Université d’Orléans, LIFO, Orléans, France
  • Université de Paris, CNRS, LLF, Paris, France
Bibliografia
  • [1]. Anne ABEILLÉ and Lionel CLÉMENT (1999-2015), Corpus le Monde, annotation morpho-syntaxique : Les mots simples - les mots composés, http://ftb.linguist.univ-paris-diderot.fr/fichiers/public/guide-morphosynt.pdf.
  • [2]. Anne ABEILLÉ, Lionel CLÉMENT, and Loïc LIÉGEOIS (2019), Un corpus arboré pour le français : le French Treebank, Traitement Automatique des Langues, 60(2):19-43.
  • [3]. Anne ABEILLÉ, Lionel CLÉMENT, and François TOUSSENEL (2003), Building a treebank for French, in Anne ABEILLÉ, editor, Treebanks: Building and using parsed corpora, pp. 165-187, Kluwer Academic Publishers, Dordrecht, The Netherlands.
  • [4]. Timothy BALDWIN and Su Nam KIM (2010), Multiword expressions, in Nitin INDURKHYA and Fred J. DAMERAU, editors, Handbook of natural language processing, second edition, pp. 267-292, CRC Press, Boca Raton.
  • [5]. Eduard BEJČEK and Pavel STRAŇÁK (2010), Annotation of multiword expressions in the Prague Dependency Treebank, Language Resources and Evaluation, 44(1-2):7-21.
  • [6]. Eduard BEJČEK, Pavel STRAŇÁK, and Daniel ZEMAN (2011), Influence of treebank design on representation of multiword expressions, in Alexander F. GELBUKH, editor, Proceedings of CICLing 2011 (volume 1), pp. 1-14, Tokyo, Japan.
  • [7]. Conor CAFFERKEY, Deirdre HOGAN, and Josef VAN GENABITH (2007), Multi-word units in treebank-based probabilistic parsing and generation, in Proceedings of RANLP 2007, pp. 98-103, Borovets, Bulgaria.
  • [8]. Nicoletta CALZOLARI, Charles J. FILLMORE, Ralph GRISHMAN, Nancy IDE, Alessandro LENCI, Catherine MACLEOD, and Antonio ZAMPOLLI (2002), Towards best practice for multiword expressions in computational lexicons, in Proceedings of LREC 2002, pp. 1934-1940, Las Palmas, Spain.
  • [9]. Marie CANDITO, Mathieu CONSTANT, Carlos RAMISCH, Agata SAVARY, Yannick PARMENTIER, Caroline PASQUER, and Jean-Yves ANTOINE (2017), Annotation d’expressions polylexicales verbales en français, in Proceedings of TALN 2017, pp. 1-9, Orléans, France.
  • [10]. Marie CANDITO and Matthieu CONSTANT (2014), Strategies for contiguous multiword expression analysis and dependency parsing, in Proceedings of ACL 2014 (volume 1: long papers), pp. 743-753, Baltimore, USA.
  • [11]. Marie CANDITO and Djamé SEDDAH (2012), Le corpus Sequoia : Annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical, in Proceedings of JEP-TALN-RECITAL 2012, pp. 321-344, Grenoble, France.
  • [12]. Dolors CATALÀ and Jorge BAPTISTA (2007), Spanish adverbial frozen expressions, in Proceedings of MWE 2007, pp. 33-40, Prague, Czech Republic.
  • [13]. Nancy A. CHINCHOR (1997), Appendix E: MUC-7 named entity task definition, in Proceedings of MUC-7, Fairfax, USA.
  • [14]. Nancy A. CHINCHOR (1998), Overview of MUC-7, in Proceedings MUC-7, Fairfax, USA.
  • [15]. Jacob COHEN (1960), A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20:37-46.
  • [16]. Mathieu CONSTANT, Gülşen ERYIĞIT, Johanna MONTI, Lonneke VAN DER PLAS, Carlos RAMISCH, Michael ROSNER, and Amalia TODIRASCU (2017), Multiword expression processing: A survey, Computational Linguistics, 43(4):837-892.
  • [17]. Ann COPESTAKE, Fabre LAMBEAU, Aline VILLAVICENCIO, Francis BOND, Timothy BALDWIN, Ivan A. SAG, and Dan FLICKINGER (2002), Multiword expressions: linguistic precision and reusability, in Proceedings of LREC 2002, pp. 1941-1947, Las Palmas, Spain.
  • [18]. Laurence DANLOS (1980), Représentations d’informations linguistiques : constructions N être Prép X, Ph.D. thesis, Université Paris 7, France.
  • [19]. Maud EHRMANN (2008), Les Entitées Nommées, de la linguistique au TAL : Statut théorique et méthodes de désambiguïsation, Ph.D. thesis, Université Paris Diderot, France.
  • [20]. Karën FORT and Benoît SAGOT (2010), Influence of pre-annotation on POS-tagged corpus development, in Proceedings of LAW 2010, pp. 56-63, Uppsala, Sweden.
  • [21]. Peter FRECKLETON (1985), Sentence idioms in English, Working Papers in Linguistics, 11:153-168.
  • [22]. Guillaume GRAVIER, Gilles ADDA, Niklas PAULSSON, Matthieu CARRÉ, Aude GIRAUDEL, and Olivier GALIBERT (2012), The ETAPE corpus for the evaluation of speech-based TV content processing in the French language, in Proceedings of LREC 2012, pp. 114-118, Istanbul, Turkey.
  • [23]. Gaston GROSS (1988), Degré de figement des noms composés, Langages, 90:57-72.
  • [24]. Maurice GROSS (1986), Lexicon-grammar: the representation of compound words, in Proceedings of COLING 1986, pp. 1-6, Bonn, Germany.
  • [25]. Maurice GROSS (1994), The lexicon-grammar of a language: application to French, in Ashley R. E., editor, The encyclopedia of language and linguistics, pp. 2195-2205, Pergamon Press, Oxford, UK.
  • [26]. Cyril GROUIN, Sophie ROSSET, Pierre ZWEIGENBAUM, Karën FORT, Olivier GALIBERT, and Ludovic QUINTARD (2011), Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview, in Proceedings of LAW 2011, pp. 92-100, Portland, USA.
  • [27]. Jan HAJIČ, Eva HAJIČOVÁ, Marie MIKULOVÁ, and Jiří MÍROVSKÝ (2017), Prague Dependency Treebank, Handbook on linguistic annotation, pp. 555-594, Springer Handbooks, Springer Verlag, ISBN 978-94-024-0879-9.
  • [28]. Georges KLEIBER (1996), Noms propres et noms communs : un problème de dénomination, Meta, 41(4):567-589.
  • [29]. Georges KLEIBER (2001), Remarques sur la dénomination, Cahiers de Praxématique, 36:21-41.
  • [30]. Georges KLEIBER (2007), Sur le rôle cognitif des noms propres, Cahiers de Lexicologie, 91(2):153-167.
  • [31]. Eric LAPORTE, Takuya NAKAMURA, and Stavroula VOYATZI (2008a), A French corpus annotated for multiword nouns, in Proceedings of MWE 2008, pp. 27-30, Marrakech, Morocco.
  • [32]. Éric LAPORTE (2018), Choosing features for classifying multiword expressions, in Manfred SAILER and Stella MARKANTONATOU, editors, Multiword expressions: insights from a multi-lingual perspective, pp. 143-186, Language Science Press, Berlin, Germany.
  • [33]. Éric LAPORTE, Takuya NAKAMURA, and Stavroula VOYATZI (2008b), A French corpus annotated for Multiword Expressions with adverbial function, in Proceedings of LAW 2008, pp. 48-51, Marrakech, Morocco.
  • [34]. Veronika LUX-POGODALLA and Alain POLGUÈRE (2011), Construction of a French lexical network: methodological issues, in Proceedings of WoLeR 2011, pp. 54-61, Ljubljana, Slovenia.
  • [35]. Katja MARKERT and Malvina NISSIM (2007), SemEval-2007 task 08: metonymy resolution at SemEval-2007, in Proceedings of SemEval 2007, pp. 36-41, Prague, Czech Republic.
  • [36]. Yann MATHET, Antoine WIDLÖCHER, and Jean-Philippe MÉTIVIER (2015), The unified and holistic method Gamma (γ) for inter-annotator agreement measure and alignment, Computational Linguistics, 41(3):437-479.
  • [37]. Igor MEL’ČUK (2010), La phraséologie en langue, en dictionnaire et en TALN, in Proceedings of TALN 2010 (invited talks), Montréal, Canada.
  • [38]. Igor MEL’ČUK (2012), Phraseology in the language, in the dictionary, and in the computer, Yearbook of Phraseology, 3:31-56.
  • [39]. Marie MIKULOVÁ, Alevtina BÉMOVÁ, Jan HAJIČ, Eva HAJIČOVÁ, Jiří HAVELKA, Veronika KOLÁŘOVÁ, Lucie KUČOVÁ, Markéta LOPATKOVÁ, Petr PAJAS, Jarmila PANEVOVÁ, Magda RAZÍMOVÁ, Petr SGALL, Jan ŠTĚPÁNEK, Zdeňka UREŠOVÁ, Kateřina VESELÁ, and Zdeněk ŽABOKRTSKÝ (2006), Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation manual, Technical report 30, ÚFAL MFF UK, Prague, Czech Republic.
  • [40]. Joakim NIVRE, Marie-Catherine DE MARNEFFE, Filip GINTER, Yoav GOLDBERG, Jan HAJIČ, Christopher D. MANNING, Ryan MCDONALD, Slav PETROV, Sampo PYYSALO, Natalia SILVEIRA, Reut TSARFATY, and Daniel ZEMAN (2016), Universal Dependencies v1: a multilingual treebank collection, in Proceedings of LREC 2016, pp. 1659-1666, Portorož, Slovenia.
  • [41]. Aurélie NÉVÉOL, Cyril GROUIN, Jeremy LEIXA, Sophie ROSSET, and Pierre ZWEIGENBAUM (2014), The QUAERO French medical corpus: a resource for medical entity recognition and normalization, in Proceedings of BioTxtM 2014, pp. 24-30, Reykjavik, Iceland.
  • [42]. Marie-Sophie PAUSÉ (2017), Structure lexico-sentaxique des locutions du français et incidence sur leur combinatoire, Ph.D. thesis, Université de Lorraine, Nancy, France.
  • [43]. Alain POLGUÈRE (2014), Principes de modélisation systémique des reseaux lexicaux, in Proceedings of TALN 2014 (volume 1: long papers), pp. 79-90, Marseille, France.
  • [44]. Carlos RAMISCH, Silvio Ricardo CORDEIRO, Agata SAVARY, Veronika VINCZE, Verginica BARBU MITITELU, Archna BHATIA, Maja BULJAN, Marie CANDITO, Polona GANTAR, Voula GIOULI, Tunga GÜNGÖR, Abdelati HAWWARI, Uxoa IÑURRIETA, Jolanta KOVALEVSKAITĖ, Simon KREK, Timm LICHTE, Chaya LIEBESKIND, Johanna MONTI, Carla PARRA ESCARTÍN, Behrang QASEMIZADEH, Renata RAMISCH, Nathan SCHNEIDER, Ivelina STOYANOVA, Ashwini VAIDYA, and Abigail WALSH (2018), Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions, in Proceedings of LAW-MWE-CxG-2018, pp. 222-240, Santa Fe, USA.
  • [45]. Carlos RAMISCH, Alexis NASR, André VALLI, and José DEULOFEU (2016), DeQue: a lexicon of complex prepositions and conjunctions in French, in Proceedings of LREC 2016, pp. 2293-2298, Portorož, Slovenia.
  • [46]. Victoria ROSÉN, Gyri Smørdal LOSNEGAARD, Koenraad DE SMEDT, Eduard BEJČEK, Agata SAVARY, Adam PRZEPIÓRKOWSKI, Petya OSENOVA, and Verginica BARBU MITETELU (2015), A survey of multiword expressions in treebanks, in Proceedings of TLT 2015, pp. 179-193, Warsaw, Poland.
  • [47]. Ivan A. SAG, Timothy BALDWIN, Francis BOND, Ann A. COPESTAKE, and Dan FLICKINGER (2002), Multiword expressions: a pain in the neck for NLP, in Proceedings CICLing 2002, pp. 1-15, Springer-Verlag, ISBN 3-540-43219-1.
  • [48]. Benoît SAGOT, Marion RICHARD, and Rosa STERN (2012), Annotation référentielle du corpus arboré de Paris 7 en entités nommées, in Proceedings of JEP-TALN-RECITAL 2012 (volume 2), pp. 535-542, Grenoble, France.
  • [49]. Benoît SAGOT and Rosa STERN (2012), Aleda, a free large-scale entity database for French, in Proceedings of LREC 2012, pp. 1273-1276, Istanbul, Turkey.
  • [50]. Agata SAVARY, Marie CANDITO, Verginica Barbu MITITELU, Eduard BEJČEK, Fabienne CAP, Slavomír ČÉPLÖ, Silvio Ricardo CORDEIRO, Gülşen ERYIĞIT, Voula GIOULI, Maarten VAN GOMPEL, Yaakov HACOHEN-KERNER, Jolanta KOVALEVSKAITĖ, Simon KREK, Chaya LIEBESKIND, Johanna MONTI, Carla Parra ESCARTÍN, Lonneke VAN DER PLAS, Behrang QASEMIZADEH, Carlos RAMISCH, Federico SANGATI, Ivelina STOYANOVA, and Veronika VINCZE (2018), PARSEME multilingual corpus of verbal multiword expressions, in Stella MARKANTONATOU, Carlos RAMISCH, Agata SAVARY, and Veronika VINCZE, editors, Multiword expressions at length and in depth: extended papers from the MWE 2017 workshop, pp. 87-147, Language Science Press, Berlin, Germany.
  • [51]. Agata SAVARY, Carlos RAMISCH, Silvio CORDEIRO, Federico SANGATI, Veronika VINCZE, Behrang QASEMIZADEH, Marie CANDITO, Fabienne CAP, Voula GIOULI, Ivelina STOYANOVA, and Antoine DOUCET (2017), The PARSEME shared task on automatic identification of verbal multiword expressions, in Proceedings of MWE 2017, pp. 31-47, Valencia, Spain.
  • [52]. Agata SAVARY, Jakub WASZCZUK, and Adam PRZEPIÓRKOWSKI (2010), Towards the annotation of named entities in the National Corpus of Polish, in Proceedings of LREC 2010, pp. 3622-3629, Valetta, Malta.
  • [53]. Nathan SCHNEIDER, Spencer ONUFFER, Nora KAZOUR, Nora EMILY DANCHIK, Michael T. MORDOWANEC, Henrietta CONRAD, and Noah A. SMITH (2014), Comprehensive annotation of multiword expressions in a social web corpus, in Proceedings of LREC 2014, pp. 455-461, Reykjavik, Iceland.
  • [54]. Djamé SEDDAH, Reut TSARFATY, Sandra KÜBLER, Marie CANDITO, Jinho D. CHOI, Richárd FARKAS, Jennifer FOSTER, Iakes GOENAGA, Koldo Gojenola GALLETEBEITIA, Yoav GOLDBERG, Spence GREEN, Nizar HABASH, Marco KUHLMANN, Wolfgang MAIER, Joakim NIVRE, Adam PRZEPIÓRKOWSKI, Ryan ROTH, Wolfgang SEEKER, Yannick VERSLEY, Veronika VINCZE, Marcin WOLIŃSKI, Alina WRÓBLEWSKA, and Eric VILLEMONTE DE LA CLERGERIE (2013), Overview of the SPMRL 2013 shared task: a cross-framework evaluation of parsing morphologically rich languages, in Proceedings of SPMRL 2013, pp. 146-182, Seattle, USA,
  • [55]. Livnat Herzig SHEINFUX, Tali Arad GRESHLER, Nurit MELNIK, and Shuly WINTNER (2019), Verbal multiword expressions: idiomaticity and flexibility, in Yannick PARMENTIER and Jakub WASZCZUK, editors, Representation and parsing of multiword expressions: current trends, pp. 35-68, Language Science Press, Berlin, Germany.
  • [56]. Erik F. TJONG KIM SANG (2002), Introduction to the CoNLL-2002 shared task: language-independent named entity recognition, in Proceedings of CoNLL 2002, volume 20, pp. 1-4, Taipei, Taiwan.
  • [57]. Erik F. TJONG KIM SANG and Fien DE MEULDER (2003), Introduction to the CoNLL-2003 shared task: language-independent named entity recognition, in Proceedings of CoNLL 2003, pp. 142-147, Edmonton, Canada.
  • [58]. Agnès TUTIN and Emmanuelle ESPERANÇA-RODIER (2019), The difficult identification of multiworld expressions: from decision criteria to annotated corpora, in Computational and corpus-based phraseology, pp. 404-416, Springer-Verlag, ISBN 978-3-030-30135-4.
  • [59]. Agnès TUTIN, Emmanuelle ESPERANÇA-RODIER, Manolo IBORRA, and Justine REVERDY (2016), Annotation of multiword expressions in French, in Proceedings of EUROPHRAS 2015, pp. 60-67, Malaga, Spain.
  • [60]. Maarten VAN GOMPEL and Martin REYNAERT (2013), FoLiA: a practical XML format for linguistic annotation - a descriptive and comparative study, Computational Linguistics in the Netherlands Journal, 3:63-81.
  • [61]. Veronika VINCZE, István NAGY T., and Gábor BEREND (2011), Multiword expressions and named entities in the Wiki50 corpus, in Proceedings of RANLP 2011, pp. 289-295, Hissar, Bulgaria.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-b4bfccba-ed5b-44b2-bd0c-3784070203b2
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.