Extraction of Polish noun senses from large corpora by means of clustering

Broda, B.; Piasecki, M.; Szpakowicz, S.

Artykuł - szczegóły

Tytuł artykułu

Extraction of Polish noun senses from large corpora by means of clustering

Autorzy

Broda B. , Piasecki M. , Szpakowicz S.

Treść / Zawartość

Pełne teksty:

http://matwbn.icm.edu.pl/ksiazki/cc/cc39/cc3926.pdf [zdalny]

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

We investigate two methods of identifying noun senses, based on clustering of lemmas and of documents. We have adapted to Polish the well-known algorithm of Clustering by Committee, and tested it on very large Polish corpora. The evaluation by means of a WordNet-based synonymy test used Polish wordnet (plWordNet 1.0). Various clustering algorithms were analysed for the needs of extraction of document clusters as indicators of the senses of words which occur in them. The two approaches to wordsense identification have been compared, and conclusions drawn.

Słowa kluczowe

corpus linguistics semantic similarity Polish nouns word clustering Clustering by Committee co-occurrence retrieval models rank weight function Polish WordNet WordNet-based synonymy test document clustering keywords extraction

Wydawca

Systems Research Institute, Polish Academy of Sciences

Czasopismo

Control and Cybernetics

Rocznik

2010

Tom

Vol. 39, no 2

Strony

401--420

Opis fizyczny

Bibliogr. 31 poz.

Twórcy

autor

Broda B.

autor

Piasecki M.

autor

Szpakowicz S.

Institute of Informatics, Wroclaw University of Technology, Poland

Bibliografia

AGIRRE, E. and EDMONDS, P., eds. (2006) Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
BRODA, B. (2007) Mechanizmy grupowania dokumentów w automatycznej ekstrakcji sieci semantycznych dla języka polskiego.
BRODA, B. and PIASECKI, M. (2008) Experiments in Documents Clustering for the Automatic Acquisition of Lexical Semantic Networks for Polish. In: M.A. Kłopotek, A. Przepiórkowski, S.T. Wierzchoń and K. Trojanowski, eds., Proceedings of the 16th International Conference Intelligent Information Systems. EXIT, Warsaw, 203-212.
BRODA, B., DERWOJEDOWA,M., PIASECKI, M. and SZPAKOWICZ, S. (2008a) Corpus-based Semantic Relatedness for the Construction of Polish Word-Net. In: Proc. 6th Language Resources and Evaluation Conference (LREC’08), 1800-1807.
BRODA, B., PIASECKI, M. and SZPAKOWICZ, S. (2008b) Sense-Based Clustering of Polish Nouns in the Extraction of Semantic Relatedness. In: Proceedings of the International Multiconference on Computer Science and Information Technology - 2nd International Symposium Advances in Artificial Intelligence and Applications (AAIA ‘08), 83-89.
BRODA, B., PIASECKI, M. and SZPAKOWICZ, S. (2009) Rank-based transformation in measuring semantic relatedness. In: Y. Gao and N. Japkowicz, eds., Canadian Conference on AI. LNCS 5549, Springer, 187-190.
DERWOJEDOWA, M., PIASECKI, M., SZPAKOWICZ, S. and ZAWISLAWSKA, M. (2007) Polish WordNet on a Shoestring. In: Proceedings of Biannual Conference of the Society for Computational Linguistics and Language Technology, Tubingen, April 11-13 2007, Universitat Tubingen, 169-178.
DERWOJEDOWA, M., PIASECKI, M., SZPAKOWICZ, S., ZAWISŁAWSKA, M. and BRODA, B. (2008) Words, Concepts and Relations in the Construction of Polish WordNet. In: A. Tanacs, D. Csendes, V. Vincze, C. Fellbaum and P. Vossen, eds., Proc. Global WordNet Conference, Szeged, Hungary January 22-25 2008, University of Szeged, 162-177.
FELLBAUM, C., ed. (1998) WordNet - An Electronic Lexical Database. The MIT Press.
FORSTER, R. (2006) Document Clustering in Large German Corpora Using Natural Language Processing. PhD thesis, University of Zurich.
FREITAG, D., BLUME, M., BYRNES, J., CNOW, E., KAPADIA, S., ROHWER, R. and WANG, Z. (2005)New Experiments in Distributional Representations of Synonymy. In: Proc. Ninth Conference on Computational Natural Language Learning (CoNLL-2005), Ann Arbor, Michigan, Association for Computational Linguistics, 25-32.
GUHA, S., RASTOGI, R. and SHIM, K. (2000) ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information Systems 25 (5), 345-366.
HARRIS, Z.S. (1968) Mathematical Structures of Language. Interscience Publishers, New York.
INDYKA-PIASECKA, A. (2004) Modele użytkownika w internetowych systemach wyszukiwania informacji (User models in the web information retrieval systems; in Polish). PhD thesis, Politechnika Wrocławska.
KARYPIS, G. (2002) CLUTO a clustering toolkit. Technical Report 02-017, Department of Computer Science, University of Minnesota. URL http: //www.cs.umn.edu/~cluto.
KOHONEN, T., KASKI, S., LAGUS, K., SALOJRVI, J., HONKELA, J., PAATERO, V. and SAARELA, A. (2000) Self-organization of a massive document collection. IEEE Transactions on Neural Networks, 11, 574-585.
LANDAUER, T.K. and DUMAIS, S.T. (1997) A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition. Psychological Review 104 (2), 211-240.
LI, H. (1998) A Probabilistic Approach to Lexical Semantic Knowledge Acquisition and Structural Disambiguation. PhD thesis, Graduate School of Science of the University of Tokyo.
MATSUO, Y. and ISHIZUKA, M. (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13 (1), 157-169.
PANTEL, P. (2003) Clustering by committee. PhD thesis, University of Alberta, Computing Science, Edmonton, Alta., Canada.
PANTEL, P. and LIN, D. (2002) Discovering Word Senses from Text. In: Proc. ACM Conference on Knowledge Discovery and Data Mining (KDD-02), Edmonton, Canada, 613-619.
PANTEL, P. and PENNACCHIOTTI, M. (2006) Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. In: Proc. 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 113-120. URL http://www.aclweb.org/anthology/P/P06/P06-1015.
PEDERSEN, T. (2006) Unsupervised Corpus Based Methods for WSD. In: E. Agirre and P. Edmonds, eds., Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 133-166.
PIASECKI, M. (2006) Handmade and Automatic Rules for Polish Tagger. In: P. Sojka, I. Kopecek and K. Pala, eds., Proc. Text, Speech and Dialog 2006 Conference, LNAI 4188, Springer, 205-212.
PIASECKI, M., SZPAKOWICZ, S. and BRODA, B. (2007a) Extended Similarity Test for the Evaluation of Semantic Similarity Functions. In: Z. Vetulani, ed., Proc. 3rd Language and Technology Conference, October 5-7, 2007, Poznań, Poland, Wydawnictwo Poznańskie Sp. z o.o., Poznań, 104-108.
PIASECKI, M., SZPAKOWICZ, S. and BRODA, B. (2007b) Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns. In: Proc. Text, Speech and Dialog 2007 Conference. LNAI 4629, Springer.
PRZEPIÓRKOWSKI, A. (2004) The IPI PAN Corpus: Preliminary version. Institute of Computer Science PAS.
RAUBER, A., MERKL, D. and DITTENBACH, M. (2002) The growing hierarchical self-organizing maps: exploratory analysis of high-dimensional data. IEEE Transactions on Neural Newtorks 13 (6), 1331-1341.
SALTON, G. and McGILL, M.J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill Inc.
TOMURO, N., LYTINEN, S.L., KANZAKI, K. and ISAHARA, H. (2007) Clustering Using Feature Domain Similarity to Discover Word Senses for Adjectives. In: Proc. 1st IEEE International Conference on Semantic Computing (ICSC-2007), IEEE, 370-377.
WEISS, D. (2008) Korpus Rzeczpospolitej. [on-line] http: //www. cs. put. poznan. pl/dweiss/rzeczpospolita. Corpus of text from the online edition of Rzeczpospolita.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BAT5-0055-0009