Lexicon-based Document Representation

Virginia, G.; Nguyen, H.S.

Powiadomienia systemowe

Sesja wygasła!
Sesja wygasła!
Sesja wygasła!
Sesja wygasła!
Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Lexicon-based Document Representation

Autorzy

Virginia G. , Nguyen H.S.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

It is a big challenge for an information retrieval system (IRS) to interpret the queries made by users, particularly because the common form of query consists of very few terms. Tolerance rough sets models (TRSM), as an extension of rough sets theory, have demonstrated their ability to enrich document representation in terms of semantic relatedness. However, system efficiency is at stake because the weight vector created by TRSM (TRSM-representation) is much less sparse. We mapped the terms occurring in TRSM-representation to terms in the lexicon, hence the final representation of a document was a weight vector consisting only of terms that occurred in the lexicon (LEX-representation). The LEX-representation can be viewed as a compact form of TRSM-representation in a lower dimensional space and eliminates all informal terms previously occurring in TRSM-vector. With these facts, we may expect a more efficient system. We employed recall and precision commonly used in information retrieval to evaluate the effectiveness of LEX-representation. Based on our examination, we found that the effectiveness of LEX-representation is comparable with TRSM-representation while the efficiency of LEX-representation should be better than the existing TRSM-representation. We concluded that lexicon-based document representation was another alternative potentially used to represent a document while considering semantics. We are tempted to implement the LEX-representation together with linguistic computation, such as tagging and feature selection, in order to retrieve more relevant terms with high weight. With regard to the TRSM method, enhancing the quality of tolerance class is crucial based on the fact that the TRSM method is fully reliant on the tolerance classes. We plan to combine other resources such as Wikipedia Indonesia to generate a better tolerance class.

Słowa kluczowe

information retrieval tolerance rough sets model

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2013

Tom

Vol. 124, nr 1/2

Strony

27--46

Opis fizyczny

Bibliogr. 24 poz., wykr.

Twórcy

autor

Virginia G.

virginia@icm.edu.pl

Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland

autor

Nguyen H.S.

son@mimuw.edu.pl

Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland

Bibliografia

[1] Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S. M. M., Williams, H. E.: Stemming Indonesian: A confix- stripping approach, ACM Transactions on Asian Language Information Processing, 6, December 2007,1-33, ISSN 1530-0226.
[2] Adriani, M., Nazief, B.: Confix-Stripping: Approach to Stemming Algorithm for Bahasa Indonesia, 1996, Internal publication.
[3] Asian, J.: Effective Techniques for Indonesian Text Retrieval, Ph.D. Thesis, School of Computer Science and Information Technology, RMIT University, March 2007, Doctor of Philosophy Thesis.
[4] Gaoxiang, Y., Heping, H., Zhengding, L., Ruixuan, L.: A Novel Web Query Automatic Expansion Based on Rough Set, Wuhan University Journal of Natural Sciences, 11(5), 2006, 1167-1171.
[5] Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E. G. M., Milios, E.: Information Retrieval by Semantic Similarity, Intern. Journal on Semantic Web and Information Systems (IJSWIS), 3(3), 2006, 55-73.
[6] Ho, T. B., Nguyen, N. B.: Nonhierarchical Document Clustering Based on a Tolerance Rough Set Model, International Journal of Intelligent Systems, 17(2), February 2002, 199-212.
[7] Janusz, A., Swieboda, W., Krasuski, A., Nguyen, H. S.: Interactive Document Indexing Method Based on Explicit Semantic Analysis, in: Rough Sets and Current Trends in Computing (J. Yao, Y. Yang, R. Słowinski, S. Greco, H. Li, S. Mitra, L. Polkowski, Eds.), vol. 7413 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2012, ISBN 978-3-642-32114-6, 156-165.
[8] Kawasaki, S., Nguyen, N. B., Ho, T. B.: Hierarchical Document Clustering Based on Tolerance Rough Set Model, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD ’00, Springer-Verlag, London, UK, 2000, ISBN 3-540-41066-X, 458-463.
[9] Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough Sets: A Tutorial, Springer-Verlag, 1998, 3-98.
[10] Lv, Y., Zhai, C.: Adaptive relevance feedback in information retrieval, Proceedings of the 18th ACM conference on Information and knowledge management, CIKM ’09, ACM, New York, NY, USA, 2009, ISBN 978-1-60558-512-3, 255-264.
[11] Lv, Y., Zhai, C.: Positional relevance model for pseudo-relevance feedback, Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10, ACM, New York, NY, USA, 2010, ISBN 978-1-4503-0153-4, 579-586.
[12] Manning, C. D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval, Cambridge University Press, 2008, ISBN 9780521865715.
[13] McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Manning Publications Co., 2010, ISBN 9781933988177.
[14] Nguyen, H. S., Ho, T. B.: Rough Document Clustering and the Internet, chapter 47, John Wiley & Sons Ltd., 2008, 987-1003.
[15] Pawlak, Z.: Rough Sets, International Journal of Computer and Information Science, 11(5), 1982, 341-356.
[16] Pawlak, Z.: Some Issues on Rough Sets, Transactions on Rough Sets I (J. F. Peters, A. Skowron, J. W. Grzymala-Busse, B. Kostek, R. W. Swiniarski, M. S. Szczuka, Eds.), 3100, Springer, 2004, ISBN 3-54022374-6, 1-58.
[17] Paz-Trillo, C., Wassermann, R., Braga, P. P.: An Information Retrieval Application using Ontologies, Journal of the Brazilian Computer Society, 11(2), 2005, 17-31.
[18] Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval, 24(5), Aug 1988, 513523, ISSN 0306-4573.
[19] Skowron, A., Stepaniuk, J.: Tolerance Approximation Spaces, Fundam. Inf., 27, August 1996, 245-253, ISSN 0169-2968.
[20] Takale, S. A., Nandgaonkar, S. S.: Measuring semantic similarity between words using web search engines, Proceedings of the 16th international conference on World Wide Web, WWW ’07, ACM, New York, NY, USA, 2007, ISBN 978-1-59593-654-7, 757-766.
[21] Vega, V. B.: Information Retrieval for the Indonesian Language, Master Thesis, National University of Singapore, 2001, Unpublished.
[22] Virginia, G., Nguyen, H. S.: Automatic Ontology Constructor for Indonesian Language, Proceedings of the 2010IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03, WI-IAT ’10, IEEE Computer Society, Washington, DC, USA, 2010, ISBN 978-0-7695-4191-4, 440-443.
[23] Virginia, G., Nguyen, H. S.: Investigating the Effectiveness of Thesaurus Generated Using Tolerance Rough Set Model, Proceedings of the 19th International Conference on Foundations of intelligent systems, IS- MIS’11, Springer-Verlag, Berlin, Heidelberg, 2011, ISBN 978-3-642-21915-3, 705-714.
[24] Voorhees, E. M., Harman, D.: Overview of the Ninth Text REtrieval Conference (TREC-9), Proceedings of the Ninth Text REtrieval Conference (TREC-9, National Institute of Standards and Technology (NIST), 2000, 1-14.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-d757c223-f87f-45a6-bba5-450ff7288418