Populating a multilingual ontology of proper names from open sources

Savary, A.; Manicki, L.; Baron, M.

doi:10.15398/jlm.v1i2.63

Artykuł - szczegóły

Tytuł artykułu

Populating a multilingual ontology of proper names from open sources

Autorzy

Savary A. , Manicki L. , Baron M.

Treść / Zawartość

Pełne teksty:

Savary_Populating a multilingual ontology of proper_2_2013.pdf

Pobierz

Identyfikatory

DOI

10.15398/jlm.v1i2.63

Warianty tytułu

Języki publikacji

Abstrakty

Even if proper names play a central role in natural language processing (NLP) applications they are still under-represented in lexicons, annotated corpora, and other resources dedicated to text processing. One of the main challenges is both the prevalence and the dynamicity of proper names. At the same time, large and regularly updated knowledge sources containing partially structured data, such as Wikipedia or GeoNames, are publicly available and contain large numbers of proper names. We present a method for a semi-automatic enrichment of Prolexbase, an existing multilingual ontology of proper names dedicated to natural language processing, with data extracted from these open sources in three languages: Polish, English and French. Fine-grained data extraction and integration procedures allow the user to enrich previous contents of Prolexbase with new incoming data. All data are manually validated and available under an open licence.

Słowa kluczowe

proper names named entities multilingual ontology population Prolexbase Wikipedia GeoNames Translatica

Wydawca

Instytut Podstaw Informatyki PAN

Czasopismo

Journal of Language Modelling

Rocznik

2013

Tom

Vol. 1, No. 2

Strony

189--225

Opis fizyczny

Bibliogr. 32 poz., rys., tab.

Twórcy

autor

Savary A.

Université François Rabelais Tours, Laboratoire d’informatique, France

autor

Manicki L.

Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Poleng, Poznań, Poland

autor

Baron M.

Université François Rabelais Tours, Laboratoire d’informatique, France
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland

Bibliografia

[1] Claire Agafonov, Thierry Grass, Denis Maurel, Nathalie Rossi-Gensane, and Agata Savary (2006), La traduction multilingue des noms propres dans PROLEX, Mεta, 51 (4): 622-636, les Presses de l’Université de Montréal.
[2] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann (2009), DBpedia – A crystallization point for the Web of Data, J. Web Sem., 7 (3): 154-165.
[3] Kurt Bollacker, Patrick Tufts, Tomi Pierce, and Robert Cook (2007), A Platform for Scalable, Collaborative, Structured Information Integration, in Proceeding of the Sixth International Workshop on Information Integration on the Web.
[4] Béatrice Bouchou and Denis Maurel (2008), Prolexbase et LMF : vers un standard pour les ressources lexicales sur les noms propres, TAL, 49 (1): 61-88.
[5] Samuel Fernando and Mark Stevenson (2012), Mapping WordNet synsets to Wikipedia articles, in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey.
[6] Sergio Ferrández, Antonio Toral, Óscar Ferrández, Antonio Ferrández, and Rafael Muñoz (2007), Applying Wikipedia’s Multilingual Knowledge to Cross-Lingual Question Answering, in Proceedings of the 12th International Conference on Applications of Natural Language to Information Systems, NLDB 2007, volume 4592 of Lecture Notes in Computer Science, p. 352-363, Springer.
[7] Filip Graliński, Krzysztof Jassem, and Michał Marcińczuk (2009), An Environment for Named Entity Recognition and Translation, in Proceedings of the 13th Annual Meeting of the European Association for Machine Translation (EAMT’09), p. 88-96, Barcelona.
[8] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Edwin Lewis-Kelham, Gerard de Melo, and Gerhard Weikum (2011), YAGO2: exploring and querying world knowledge in time, space, context, and many languages, in Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011 (Companion Volume), p. 229-232, ACM.
[9] Krzysztof Jassem (2004), Applying Oxford-PWN English-Polish dictionary to Machine Translation, in Proceedings of 9th European Association for Machine Translation Workshop, “Broadening horizons of machine translation and its applications”, Malta, April, p. 98-105.
[10] Valentin Jijkoun, Mahboob Alam Khalid, Maarten Marx, and Maarten de Rijke (2008), Named entity normalization in user generated content, in Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, AND 2008, ACM International Conference Proceeding Series, p. 23-30, ACM.
[11] Cvetana Krstev, Duško Vitas, Denis Maurel, and Mickaël Tran (2005), Multilingual Ontology of Proper Names, in Proceedings of Language and Technology Conference (LTC’05), Poznań, Poland, p. 116-119, Wydawnictwo Poznańskie.
[12] Giridhar Kumaran and James Allan (2004), Text classification and named entities for new event detection, in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR’04, p. 297-304.
[13] Denis Maurel (2008), Prolexbase. A multilingual relational lexical database of proper names, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Marocco, p. 334-338.
[14] Denis Maurel, Nathalie Friburger, Jean-Yves Antoine, Iris Eshkol-Taravella, and Damien Nouvel (2011), Cascades de transducteurs autour de la reconnaissance des entités nommées, Traitement Automatiques des Langues, 52 (1): 69-96.
[15] Gerard de Melo and Gerhard Weikum (2009), Towards a universal wordnet by learning from combined evidence, in Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6, 2009, pp. 513-522, ACM.
[16] Gerard de Melo and Gerhard Weikum (2010), MENTA: inducing multilingual taxonomies from wikipedia, in Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, p. 1099-1108, ACM.
[17] Pablo Mendes, Max Jakob, and Christian Bizer (2012), DBpedia: A Multilingual Cross-domain Knowledge Base, in Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.
[18] George A. Miller (1995), WordNet: A Lexical Database for English, Commun. ACM, 38 (11): 39-41.
[19] Hien Thang Nguyen and Tru Hoang Cao (2010), Enriching Ontologies for Named Entity Disambiguation, in Proceedings of the 4th International Conference on Advances in Semantic Processing (SEMAPRO 2010), Florence, Italy.
[20] Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Anastasia Krithara, and Elias Zavitsanos (2011), Ontology Population and Enrichment: State of the Art, in Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, volume 6050 of Lecture Notes in Computer Science, p. 134-166, Springer.
[21] Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors (2012), Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], Wydawnictwo Naukowe PWN, Warsaw.
[22] Alexander E. Richman and Patrick Schone (2008), Mining Wiki Resources for Multilingual Named Entity Recognition, in Kathleen McKeown, Johanna D. Moore, Simone Teufel, James Allan, and Sadaoki Furui, editors, ACL, pp. 1-9, The Association for Computer Linguistics, ISBN 978-1-932432-04-6.
[23] K. Saravanan, Monojit Choudhury, Raghavendra Udupa, and A. Kumaran (2012), An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora, in Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.
[24] Agata Savary, Leszek Manicki, and Małgorzata Baron (2013), ProlexFeeder – Populating a Multilingual Ontology of Proper Names from Open Sources, Technical Report 306, Laboratoire d’informatique, François Rabelais University of Tours, France.
[25] Pavel Shvaiko and Jérôme Euzenat (2013), Ontology Matching: State of the Art and Future Challenges, IEEE Trans. Knowl. Data Eng., 25 (1):158-176.
[26] Ralf Steinberger, Bruno Pouliquen, Mijail Alexandrov Kabadjov, Jenya Belyaeva, and Erik Van der Goot (2011), JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource, in Recent Advances in Natural Language Processing, RANLP 2011, 12-14 September, 2011, Hissar, Bulgaria, p. 104-110.
[27] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum (2007), YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, in WWW’07: Proceedings of the 16th International World Wide Web Conference, p. 697-706, Banff, Canada.
[28] Antonio Toral, Sergio Ferrández, Monica Monachini, and Rafael Muñoz (2012), Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon, Language Resources and Evaluation, 46 (3): 383-419.
[29] Antonio Toral, Rafael Muñoz, and Monica Monachini (2008), Named Entity WordNet, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2008), European Language Resources Association, Marrakech, Morocco.
[30] Mickaël Tran and Denis Maurel (2006), Prolexbase: Un dictionnaire relatonnel multilingue de noms propres, Traitement Automatiques des Langues, 47 (3): 115-139.
[31] Mickaël Tran, Denis Maurel, Duško Vitas, and Cvetana Krstev (2005), A French-Serbian Web Collaborative Work on a Multilingual Dictionary of Proper Names, in Proceedings of the 6th Workshop on Multilingual Lexical Databases (PAPILLON’05), Chiang Rai, Thailand.
[32] Piek Vossen (1998), Introduction to EuroWordNet, Computers and the Humanities, 32 (2-3): 73-89.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-2cdfeec2-0a76-4ad2-9a5e-b3dc7311b44e