Paintball : Automated Wordnet Expansion Algorithm based on Distributional Semantics and Information Spreading

Piasecki, Maciej

doi:10.12921/cmst.2018.0000051

Artykuł - szczegóły

Tytuł artykułu

Paintball : Automated Wordnet Expansion Algorithm based on Distributional Semantics and Information Spreading

Autorzy

Piasecki Maciej

Wybrane pełne teksty z tego czasopisma

http://cmst.eu/

Identyfikatory

DOI

10.12921/cmst.2018.0000051

Warianty tytułu

Języki publikacji

Abstrakty

plWordNet has been consequently built on the basis of the corpus-based wordnet development method. As plWordNet construction had started from scratch it was necessary to find a way to reduce the amount of work required, and not to reduce the quality. In the paper we discuss the gained experience in applying different tools based on Distributional Semantics methods to support the work of lexicographers. A special attention is given to the Paintball algorithm for semiautomated wordnet expansion and its application in the WordnetWeaver system.

Słowa kluczowe

wordnet lexical semantic network automated wordnet expansion natural language engineering linguistic knowledge extraction WordNet plWordNet

Wydawca

Institute of Bioorganic Chemistry Scientific Publishers OWN, Polish Academy of Sciences

Czasopismo

Computational Methods in Science and Technology

Rocznik

2019

Tom

Vol. 25, No. 1

Strony

41--56

Opis fizyczny

Bibliogr. 53 poz., tab.

Twórcy

autor

Piasecki Maciej

maciej.piasecki@pwr.wroc.pl

Faculty of Computer Science and Management, G4.19 Research Group Wrocław University of Science and Technology Wyb. Wyspia ´nskiego 27, 50-370 Wrocław, Poland

Bibliografia

[1] M. Maziarz, M. Piasecki, E. Rudnicka, S. Szpakowicz, P. K˛edzia, plwordnet 3.0 – a comprehensive lexical-semantic resource, [in:] COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11–16, 2016, Osaka, Japan (N. Calzolari, Y. Matsumoto, R. Prasad, eds.), pp. 2259–2268, ACL, ACL, 2016.
[2] P. Vossen, ed., EuroWordNet. A multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, 1998.
[3] P. Vossen, EuroWordNet General Document Version 3, tech. rep., Univ. of Amsterdam, 2002.
[4] M. Derwojedowa, M. Piasecki, S. Szpakowicz, M. Zawisławska, B. Broda, Words, Concepts and Relations in the Construction of Polish WordNet, in Proc. Fourth Global WordNet Conf. (A. Tanács, D. Csendes, V. Vincze, C. Fellbaum, P. Vossen, eds.), pp. 162–177, 2008.
[5] M. Maziarz, M. Piasecki, E. Rudnicka, S. Szpakowicz, Beyond the transfer-and-merge wordnet construction: plWordNet and a comparison with WordNet, [in:] Proc. International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 443–452, INCOMA Ltd. Shoumen, BULGARIA, 2013.
[6] D. Widdows, Geometry and Meaning. CSLI Publications, 2004.
[7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, CoRR, vol. abs/1310.4546, 2013.
[8] M. Piasecki, S. Szpakowicz, B. Broda, A Wordnet from the Ground Up. Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej, 2009.
[9] A. Przepiórkowski, The IPI PAN Corpus: Preliminary version. Institute of Computer Science, Polish Academy of Sciences, 2004.
[10] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, eds., Narodowy Korpus Języka Polskiego [in Polish]. Wydawnictwo Naukowe PWN, 2012. http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf.
[11] D. Weiss, Korpus Rzeczpospolitej [Corpus of text from the online edition of “Rzeczpospolita”], http://www.cs.put.poznan.pl/dweiss/rzeczpospolita, 2008.
[12] M. Woliński, Morfeusz reloaded, [in:] Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 1106–1111, ELRA, 2014.
[13] B. Svensén, A Handbook of Lexicography. The Theory and Practice of Dictionary-Making. Cambridge University Press, 2009.
[14] C. Fellbaum, A Semantic Network of English: The Mother of All WordNets, Computers and the Humanities, vol. 32, pp. 209–220, 1998.
[15] B. Broda, M. Maziarz, M. Piasecki, Tools for plWordNet Development. Presentation and Perspectives, [in:] Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pp. 3647–3652, European Language Resources Association (ELRA), May 2012.
[16] M. Piasecki, M. Marcinczuk, R. Ramocki, M. Maziarz, ´ WordNetLoom: a WordNet development system integrating formbased and graph-based perspectives, International Journal of Data Mining, Modelling and Management, vol. 5, no. 3, pp. 210–232, 2013.
[17] T. Naskr˛et, A. Dziob, M. Piasecki, C. Saedi, A. Branco, WordnetLoom – a multilingual wordnet editing system focused on graph-based presentation, [in:] Proceedings of the 9th Global Wordnet Conference, Singapore, 8–12 January 2018 (F. Bond, C. Fellbaum, P. Vossen, eds.), Global Wordnet Association, 2018.
[18] M. Wynne, ed., Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books, 2005.
[19] plWordNet, Frequency List from plWorNet Corpus, 2012. www.nlp.pwr.wroc.pl/pl/narzedzia-i-zasoby/lista-frekwencyjna.
[20] B. Broda and M. Piasecki, Parallel, massive processing in SuperMatrix – a general tool for distributional semantic analysis of corpora, International Journal of Data Mining, Modelling and Management, vol. 5, no. 1, pp. 1–19, 2013.
[21] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, CoRR, vol. abs/1301.3781, 2013.
[22] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606, 2016.
[23] M. Piasecki, G. Czachor, A. Janz, D. Kaszewski, P. K˛edzia, Wordnet-based evaluation of large distributional models for polish, in Proceedings of the 9th Global Wordnet Conference, Singapore, 8-12 January 2018 (F. Bond, C. Fellbaum, P. Vossen, eds.), Global WordNet Association, 2018.
[24] M. Piasecki, A. Janz, D. Kaszewski, G. Czachor, Word embeddings for polish, 2017. CLARIN-PL digital repository.
[25] J. Kocoń KGR10 FastText polish word embeddings, 2018.CLARIN-PL digital repository.
[26] J. Kocoń and M. Marcińczuk, Word embeddings for polish (KGR10, fasttext binary) kgr10_fasttext_bin_v1, 2018.CLARIN-PL digital repository.
[27] G. Karypis, CLUTO a clustering toolkit, Technical Report 02-017, Department of Computer Science, University of Minnesota, 2002.
[28] B. Broda, M. Maziarz, M. Piasecki, Evaluating LexCSD – a Weakly-Supervised Method on Improved Semantically Annotated Corpus in a Large Scale Experiment, [in:] Proceedings of a Conference on Intelligent Information Systems (M.A. Kłopotek, A. Przepiórkowski and K. Trojanowski, eds.), 2010.
[29] D. Janus and A. Przepiórkowski, Poliqarp 1.0: Some technical aspects of a linguistic search engine for large corpora, [in:] The proceedings of Practical Applications of Linguistic Corpora, 2005.
[30] T. Machalek, KonText – a modern, customizable corpus query interface, in Book of Abstracts of the Corpus Linguistics 2017 Conference, 25-28 July 2017, (Birmingham), University of Birmingham, 2017.
[31] M. Piasecki and M. Wendelberger, Partial measure of semantic relatedness based on the local feature selection, [in:] Text, Speech and Dialogue – 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014.Proceedings (P. Sojka, A. Horák, I. Kopecek, K. Pala, eds.), vol. 8655 of Lecture Notes in Computer Science, pp. 336–343, Springer, 2014.
[32] M. Piasecki, R. Ramocki, M. Kalinski, ´ Information spreading in expanding wordnet hypernymy structure, [in:] Proc. International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 553–561, INCOMA Ltd. Shoumen, BULGARIA, 2013.
[33] M. Piasecki, Ł. Burdka, M. Maziarz, M. Kaliński, Diagnostic tools in plwordnet development process, [in:] Human Language Technology. Challenges for Computer Science and Linguistics (Z. Vetulani, H. Uszkoreit, and M. Kubis, eds.), vol. 9561 of LNCS, pp. 255–273, Springer, 2016.
[34] R. Snow, D. Jurafsky, A.Y. Ng., Semantic taxonomy induction from heterogenous evidence., pp. 801–808, The Association for Computer Linguistics, 2006.
[35] A.M. Collins and E.F. Loftus, A spreading-activation theory of semantic processing, Psychological Review, vol. 82, no. 6, pp. 407–428, 1975.
[36] G. Salton and C. Buckley, On the use of spreading activation methods in automatic Information Retrieval, [in:] Proceedings of ACM SIGIR, 1988.
[37] N.M. Akim, A. Dix, A. Katifori, G. Lepouras, N. Shabir, C. Vassilakis, Spreading activation for web scale reasoning: Promise and problems, in Proceedings of WebSci ’11, June 14-17, 2011, Koblenz, Germany, 2011.
[38] A. Troussov, M. Sogrin, J. Judge, D. Botvich, Mining sociosemantic networks using spreading activation technique, [in:] Proceedings of I-KNOW ’08 and I-MEDIA ’08 Graz, Austria, September 3–5, 2008, pp. 405–412, 2008.
[39] M. Piasecki, R. Kurc, R. Ramocki, B. Broda, Lexical Activation Area Attachment Algorithm for Wordnet Expansion, [in:] Proc. 15th International Conference on Artificial Intelligence: Methodology, Systems, Applications (A. Ramsay and G. Agre, eds.), vol. 7557 of Lecture Notes in Computer Science, pp. 23–31, Springer, 2012.
[40] M. Derwojedowa, S. Szpakowicz, M. Zawisławska, M. Piasecki, Lexical Units as the Centrepiece of a Wordnet, [in:] Proc. 16th Int. Conf. on Intelligent Information Systems (M.A. Kłopotek, A. Przepiórkowski, S.T. Wierzchoń, K. Trojanowski, eds.), pp. 351–358, 2008.
[41] M. Maziarz, M. Piasecki, S. Szpakowicz, The chicken-andegg problem in wordnet design: synonymy, synsets and constitutive relations, Language Resources and Evaluation, vol. 47, no. 3, pp. 769–796, 2013.
[42] C. Fellbaum, ed., WordNet – An Electronic Lexical Database. The MIT Press, 1998.
[43] Ł. Kłyk, P. Myszkowski, B. Broda, M. Piasecki, D. Urbansky, Metaheuristics for tuning model parameters in two natural language processing applications, [in:] Proceedings of the 15th International Conference on Artificial Intelligence: Methodology, Systems, Applications (A. Ramsay and G. Agre, eds.), vol. 7557 of Lecture Notes in Computer Science, (Varna, Bulgaria), pp. 32–37, Springer, 2012.
[44] B. Broda, R. Kurc, M. Piasecki, R. Ramocki, Evaluation method for automated wordnet expansion, [in:] Security and Intelligent Information Systems (P. Bouvry, M. Kłopotek, F. Leprevost, M. Marciniak, A. Mykowiecka, H. Rybiński, eds.), LNCS, Springer, 2011.
[45] R. Snow, D. Jurafsky, A.Y. Ng, Learning syntactic patterns for automatic hypernym discovery, [in:] NIPS, 2004.
[46] D. Lin, Principle-based parsing without overgeneration, [in:] Proc. ACL-93, Columbus, Ohio, 1993.
[47] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
[48] G. Israel, Determining sample size, tech. rep., University of Florida, 1992.
[49] M. Piasecki, S. Szpakowicz, B. Broda, Automatic selection of heterogeneous syntactic features in semantic similarity of Polish nouns, [in:] Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007, Proceedings (V. Matousek and P. Mautner, eds.), vol. 4629 of LNCS, pp. 99–106, Springer, 2007.
[50] M. Piasecki, M.M. annd Stanisław Szpakowicz, B. Broda, Classification-based filtering of semantic relatedness in hypernymy extraction, [in:] Advances in Natural Language Processing, 6th International Conference, GoTAL 2008, Gothenburg, Sweden, August 25-27, 2008, Proceedings (B. Nordström and A. Ranta, eds.), vol. 5221 of LNCS, pp. 393–404, Springer, 2008.
[51] M.A. Hearst, Automated Discovery of WordNet Relations, ch. 5, pp. 131–151. Vol. 1 of Fellbaum [42], 1998.
[52] R. Kurc and M. Piasecki, Automatic acquisition of wordnet relations by the morpho-syntactic patterns extracted from the corpora in polish, [in:] Proceedings of the International Multiconference on Computer Science and Information Technology – 3nd International Symposium Advances in Artificial Intelligence and Applications (AAIA’08), pp. 181–188, 2008.
[53] R. Kurc, M. Piasecki, S. Szpakowicz, Automatic acquisition of wordnet relations by distributionally supported morphological patterns extracted from polish corpora, [in:] Text, Speech and Dialogue, 13th International Conference, TSD 2010, Brno, Czech Republic, September 6-10, 2010. Proceedings (P. Sojka, A. Horák, I. Kopecek, K. Pala, eds.), vol. 6231 of Lecture Notes in Computer Science, pp. 133–141, 2010.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-c49fe7ce-c14e-4af6-b84b-b859ca10f770