PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Evaluation of automatic updates of Roget’s Thesaurus

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Thesauri and similarly organised resources attract increasing interest of Natural Language Processing researchers. Thesauri age fast, so there is a constant need to update their vocabulary. Since a manual update cycle takes considerable time, automated methods are required. This work presents a tuneable method of measuring semantic relatedness, trained on Roget’s Thesaurus, which generates lists of terms related to words not yet in the Thesaurus. Using these lists of terms, we experiment with three methods of adding words to the Thesaurus. We add, with high confidence, over 5500 and 9600 new word senses to versions of Roget’s Thesaurus from 1911 and 1987 respectively. We evaluate our work both manually and by applying the updated thesauri in three NLP tasks: selection of the best synonym from a set of candidates, pseudo-word-sense disambiguation and SAT-style analogy problems. We find that the newly added words are of high quality. The additions significantly improve the performance of Roget’s-based methods in these NLP tasks. The performance of our system compares favourably with that of WordNet-based methods. Our methods are general enough to work with different versions of Roget’s Thesaurus.
Rocznik
Strony
1--49
Opis fizyczny
Bibliogr. 64 poz., rys., tab., wykr.
Twórcy
autor
  • School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada
  • School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada
  • Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Bibliografia
  • [1] Satanjeev Banerjee and Ted Pedersen (2002), An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet, in Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics – CICLing 2002, pp. 136-145, Mexico City, Mexico.
  • [2] Patrick J. Cassidy (2000), An Investigation of the Semantic Relations in the Roget’s Thesaurus: Preliminary Results, in Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics – CICLing 2000, pp. 181-204, Mexico City, Mexico.
  • [3] Stephen Clark and David Weir (2002), Class-based Probability Estimation Using a Semantic Hierarchy, Computational Linguistics, 28 (2): 187-206.
  • [4] Carolyn J. Crouch (1988), A Cluster-based Approach to Thesaurus Construction, in Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 1988, pp. 309-320, Grenoble, France.
  • [5] Carolyn J. Crouch and Bokyung Yang (1992), Experiments in Automatic Statistical Thesaurus Construction, in Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 1992, pp. 77-88, Copenhagen, Denmark.
  • [6] Dagan, Ido, Lillian Lee, and Fernando Pereira (1999), Similarity-based Models of Word Co-occurrence Probabilities, Machine Learning, 34 (1-3): 43-69.
  • [7] Andrea Esuli and Fabrizio Sebastiani (2006), SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining, in Proceedings of the 5th Conference on Language Resources and Evaluation – LREC 2006, pp. 417-422, Genoa, Italy.
  • [8] Christiane Fellbaum, editor (1998), WordNet: an Electronic Lexical Database, MIT Press, Cambridge, MA, USA.
  • [9] William A. Gale, Kenneth W. Church, and David Yarowsky (1992), A Method for Disambiguating Word Senses in a Large Corpus, Computers and the Humanities, 26: 415-439.
  • [10] Roxana Girju, Adriana Badulescu, and Dan Moldovan (2003), Learning Semantic Constraints for the Automatic Discovery of Part-Whole Relations, in Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics – HLT-NAACL 2003, pp. 1-8, Edmonton, Canada.
  • [11] Roxana Girju, Adriana Badulescu, and Dan Moldovan (2006), Automatic Discovery of Part-Whole Relations, Computational Linguistics, 32 (1): 83-136.
  • [12] Marti A. Hearst (1992), Automatic Acquisition of Hyponyms from Large Text Corpora, in Proceedings of the 14th International Conference on Computational Linguistics – COLING 1992, pp. 539-545, Nantes, France.
  • [13] Graeme Hirst and David St-Onge (1998), Lexical Chains as Representation of Context for the Detection and Correction of Malapropisms, in Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database, pp. 305-322, MIT Press, Cambridge, MA, USA.
  • [14] Mario Jarmasz (2003), Roget’s Thesaurus as a Lexical Resource for Natural Language Processing, Master’s thesis, University of Ottawa, Canada.
  • [15] Mario Jarmasz and Stan Szpakowicz (2003a), Not as Easy As It Seems: Automating the Construction of Lexical Chains Using Roget’s Thesaurus, in 16th Conference of the Canadian Society for Computational Studies of Intelligence – AI 2003, Halifax, Canada, number 2671 in Lecture Notes in Computer Science, pp. 544-549, Springer, Berlin/Heidelberg, Germany.
  • [16] Mario Jarmasz and Stan Szpakowicz (2003b), Roget’s Thesaurus and Semantic Similarity, in Proceedings of the Conference on Recent Advances in Natural Language Processing – RANLP 2003, pp. 212-219, Borovets, Bulgaria.
  • [17] Mario Jarmasz and Stan Szpakowicz (2004), Roget’s Thesaurus and Semantic Similarity, in N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, volume 260 of Current Issues in Linguistic Theory, pp. 111-120, John Benjamins, Amsterdam, The Netherlands/Philadelphia, PA, USA.
  • [18] Jay J. Jiang and David W. Conrath (1997), Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy, in Proceedings of the 10th International Conference on Research on Computational Linguistics – ROCLING X, pp. 19-33, Taipei, Taiwan.
  • [19] Joshua C. Kendall (2008), The Man Who Made Lists: Love, Death, Madness, and the Creation of Roget’s Thesaurus, G. P.Putnam’s Sons, New York, NY, USA.
  • [20] Alistair Kennedy and Graeme Hirst (2012), Measuring Semantic Relatedness Across Languages, in xLiTe: Cross-Lingual Technologies, workshop collocated with the Conference on Neural Information Processing Systems – NIPS 2012, Lake Tahoe, NV, USA.
  • [21] Alistair Kennedy and Stan Szpakowicz (2007), Disambiguating Hypernym Relations for Roget’s Thesaurus, in Proceedings of the 10th International Conference on Text, Speech and Dialogue – TSD 2007, Pilsen, Czech Republic, number 4629 in Lecture Notes in Artificial Intelligence, pp. 66-75, Springer, Berlin/Heidelberg, Germany.
  • [22] Alistair Kennedy and Stan Szpakowicz (2011), A Supervised Method of Feature Weighting for Measuring Semantic Relatedness, in Proceedings of the Canadian Conference on Artificial Intelligence – AI 2011, St. John’s, Canada, number 6657 in Lecture Notes in Artificial Intelligence, pp. 222-233, Springer, Berlin/Heidelberg, Germany.
  • [23] Alistair Kennedy and Stan Szpakowicz (2012), Supervised Distributional Semantic Relatedness, in Proceedings of the 15th International Conference on Text, Speech and Dialogue – TSD 2012, Brno, Czech Republic, number 7499 in Lecture Notes in Artificial Intelligence, pp. 207-214, Springer, Berlin/Heidelberg, Germany.
  • [24] Betty Kirkpatrick, editor (1987), Roget’s Thesaurus of English Words and Phrases, Longman, Harlow, UK.
  • [25] Klaus Krippendorff (2004), Content Analysis: An Introduction to Its Methodology, Sage Publications Inc., Los Angeles, CA, USA, 2nd edition.
  • [26] Oi Yee Kwong (1998a), Aligning WordNet with Additional Lexical Resources, in Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, pp. 73-79, Montréal, Canada.
  • [27] Oi Yee Kwong (1998b), Bridging the Gap Between Dictionary and Thesaurus, in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics – ACL 1998, pp. 1487-1489, Montréal, Canada.
  • [28] Robin J. Landis and G. G. Koch (1977), The Measurement of Observer Agreement for Categorical Data, Biometrics, 33: 159-174.
  • [29] Claudia Leacock and Martin Chodorow (1998), Combining Local Context and WordNet Sense Similarity for Word Sense Disambiguation, in Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database, pp. 265-284, MIT Press, Cambridge, MA, USA.
  • [30] Lillian Lee (1999), Measures of Distributional Similarity, in Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics – ACL 1999, pp. 25-32, College Park, MD, USA.
  • [31] Lothar Lemnitzer, Holger Wunsch, and Piklu Gupta (2008), Enriching GermaNet with Verb-Noun Relations – a Case Study of Lexical Acquisition, in Proceedings of the 6th International Conference on Language Resources and Evaluation – LREC 2008, Marrakech, Morocco.
  • [32] Dekang Lin (1998a), Automatic Retrieval and Clustering of Similar Words, in Proceedings of the 17th International Conference on Computational Linguistics – COLING 1998, pp. 768-774, Montréal, Canada.
  • [33] Dekang Lin (1998b), Dependency-Based Evaluation of MINIPAR, in Proceedings of the Workshop on the Evaluation of Parsing Systems at the 1st International Conference on Language Resources and Evaluation – LREC 1998, Granada, Spain.
  • [34] Bernardo Magnini and Gabriela Cavagliá (2000), Integrating Subject Field Codes into WordNet, in Proceedings of the 2nd International Conference on Language Resources and Evaluation – LREC 2000, pp. 1413-1418, Athens, Greece.
  • [35] Jeff Mitchell and Mirella Lapata (2008), Vector-based models of semantic composition, in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – ACL 2008: HLT, pp. 236-244, Columbus, OH, USA.
  • [36] Emmanuel Morin and Christian Jacquemin (1999), Projecting Corpus-Based Semantic Links on a Thesaurus, in Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics – ACL 1999, pp. 389-396, College Park, MD, USA.
  • [37] Jane Morris and Graeme Hirst (1991), Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text, Computational Linguistics, 17 (1): 21-48.
  • [38] Vivi Nastase and Stan Szpakowicz (2001), Word Sense Disambiguation in Roget’s Thesaurus Using WordNet, in Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources, pp. 12-22, Pittsburgh, PA, USA.
  • [39] Patrick Pantel (2003), Clustering by Committee, Ph.D. thesis, University of Alberta, Canada.
  • [40] Patrick Pantel (2005), Inducing Ontological Co-occurrence Vectors, in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics – ACL 2005, pp. 125-132, Ann Arbor, MI, USA.
  • [41] Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen (2003), Using Measures of Semantic Relatedness for Word Sense Disambiguation, in Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics – CICLing-2003, pp. 241-257, Mexico City, Mexico.
  • [42] Ted Pedersen, Siddhart Patwardhan, and Jason Michelizzi (2004), Wordnet::Similarity - Measuring the Relatedness of Concepts, in Proceedings of the 19th National Conference on Artificial Intelligence – AAAI 2004, pp. 1024-1025, San Jose, CA, USA.
  • [43] Maciej Piasecki, Bartosz Broda, Michał Marcińczuk, and Stan Szpakowicz (2009), The WordNet Weaver: Multi-criteria Voting for Semi-automatic Extension of a Wordnet, in Proceedings of the 22nd Canadian Conference on Artificial Intelligence – AI 2009, Kelowna, Canada, number 5549 in Lecture Notes in Artificial Intelligence, pp. 237-240, Springer, Berlin/Heidelberg, Germany.
  • [44] Paul Procter (1978), Longman Dictionary of Contemporary English, Longman Group Ltd., Essex, UK.
  • [45] Philip Resnik (1995), Using Information Content to Evaluate Semantic Similarity, in Proceedings of the 14th International Joint Conference on Artificial Intelligence – IJCAI 1995, pp. 448-453, Montréal, Canada.
  • [46] Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil (1999), Inducing a Semantically Annotated Lexicon via EM-based Clustering, in Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics – ACL 1999, pp. 1040-111, College Park, MD, USA.
  • [47] Sara Rydin (2002), Building a Hyponymy Lexicon with Hierarchical Structure, in Proceedings of the ACL-02/SIGLEX Workshop on Unsupervised Lexical Acquisition – ULA 2002, pp. 26-33, Philadelphia, PA, USA.
  • [48] Benoît Sagot and Darja Fišer (2011), Extending wordnets by learning from multiple resources, in Proceedings of the 5th Language and Technology Conference – LTC 2011, pp. 526-530, Poznań, Poland.
  • [49] Erik Tjong Kim Sang (2007), Extracting Hypernym Pairs from the Web, in Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics – ACL 2007 (Interactive Poster and Demonstration Sessions), pp. 165-168, Prague, Czech Republic.
  • [50] Hinrich Schütze (1998), Automatic Word Sense Discrimination, Computational Linguistics, 24 (1): 97-123.
  • [51] Keiji Shinzato and Kentaro Torisawa (2004), Acquiring Hyponymy Relations from Web Documents, in Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics – HLT-NAACL 2004, pp. 73-80, Boston, MA, USA.
  • [52] Rion Snow, Daniel Jurafsky, and Andrew Y. Ng (2005), Learning Syntactic Patterns for Automatic Hypernym Discovery, in Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pp. 1297-1304, MIT Press, Cambridge, MA, USA.
  • [53] Rion Snow, Daniel Jurafsky, and Andrew Y. Ng (2006), Semantic Taxonomy Induction from Heterogenous Evidence, in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics – COLING/ACL 2006, Sydney, Australia.
  • [54] Ratanachai Sombatsrisomboon, Yutaka Matsuo, and Mitsuru Ishizuka (2003), Acquisition of Hypernyms and Hyponyms from the WWW, in Procedings of the 2nd International Workshop on Active Mining – AM 2003 (in Conjunction with the International Symposium on Methodologies for Intelligent Systems), pp. 7-13, Maebashi City, Japan.
  • [55] Carlo Strapparava and Alessandro Valitutti (2004), WordNet-Affect: an Affective Extension of WordNet, in Proceedings of the 4th International Conference on Language Resources and Evaluation – LREC 2004, pp. 1083-1086, Lisbon, Portugal.
  • [56] Hiroaki Tsurumaru, Toru Hitaka, and Sho Yoshida (1986), An Attempt to Automatic Thesaurus Construction from an Ordinary Japanese Language Dictionary, in Proceedings of the 11th Conference on Computational Linguistics – COLING 1986, pp. 445-447, Bonn, Germany.
  • [57] Peter Turney (2005), Measuring Semantic Similarity by Latent Relational Analysis, in Proceedings of the 19th International Joint Conference on Artificial Intelligence – IJCAI-05, pp. 1136-1141, Edinburgh, Scotland.
  • [58] Peter Turney (2006), Similarity of Semantic Relations, Computational Linguistics, 32 (3): 379-416.
  • [59] Peter Turney (2012), Domain and Function: A Dual-Space Model of Semantic Relations and Compositions, Journal of Artificial Intelligence Research, 44: 533-585.
  • [60] Piek Vossen, editor (1998), EuroWordNet: a Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Norwell, MA, USA.
  • [61] Julie Weeds and David Weir (2005), Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity, Computational Linguistics, 31 (4): 439-475.
  • [62] Zhibiao Wu and Martha Palmer (1994), Verb Semantics and Lexical Selection, in Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics – ACL 1994, pp. 133-138, Las Cruces, NM, USA.
  • [63] Hao Zheng, Xian Wu, and Yong Yu (2008), Enriching WordNet with Folksonomies, in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining – PAKDD 2008, number 5012 in Lecture Notes in Artificial Intelligence, pp. 1075-1080, Springer, Berlin/Heidelberg, Germany.
  • [64] Maayan Zhitomirsky-Geffet and Ido Dagan (2009), Bootstrapping distributional feature vector quality, Computational Linguistics, 35 (3): 435-461.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-35ce6e91-b898-4ca8-bb2c-8302f0b3af81
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.