Classification of text documents by using expanded terms in Latent Semantic Analysis

Śmiałkowska, B.; Gibert, M.

Artykuł - szczegóły

Tytuł artykułu

Classification of text documents by using expanded terms in Latent Semantic Analysis

Autorzy

Śmiałkowska B. , Gibert M.

Treść / Zawartość

Pełne teksty:

Smialkowska_Gilbert_Classification_of_Text_Documents.pdf

Pobierz

Identyfikatory

Warianty tytułu

Klasyfikacja dokumentów tekstowych przy użyciu rozbudowanych wyrażeń w niejawnej analizie semantycznej

Języki publikacji

Abstrakty

In this article attention is paid to improving the quality of text document classification. The common techniques of analysis of text documents used in classification are shown and the weakness of these methods arc stressed. Discussed here is the integration of quantitative and qualitative methods, which is increasing the quality of classification. In the proposed approach the expanded terms, obtained by using information patterns are used in the Latent Semantic Analysis. Finally empirical research is presented and based upon the quality measures of the text document classification, the effectiveness of the proposed approach is proved.

W artykule skoncentrowano się na poprawie jakości klasyfikacji dokumentów tekstowych. Zostały przybliżone najpopularniejsze techniki analizy dokumentów tekstowych wykorzystywanych w klasyfikacji. Zwrócono uwagę na słabe strony opisanych technik. Omówiono możliwość integracji metod ilościowych i jakościowych analizy tekstu i jej wpływ na poprawę jakości klasyfikacji. Zaproponowano rozwiązanie, w którym rozbudowane wyrażenia otrzymane za pomocą wzorców informacyjnych są wykorzystywane w niejawnej analizie semantycznej. Ostatecznie w oparciu o miary jakości klasyfikacji dokumentów tekstowych zaprezentowano wyniki badań testowych, które potwierdzają skuteczność zaproponowanego rozwiązania.

Słowa kluczowe

text classification information extraction Latent Semantic Analysis information retrieval text representation

Wydawca

Instytut Informatyki Teoretycznej i Stosowanej Polskiej Akademii Nauk

Czasopismo

Theoretical and Applied Informatics

Rocznik

2013

Tom

Vol. 25, No. 3-4

Strony

239--250

Opis fizyczny

Bibliogr. 14 poz., rys.

Twórcy

autor

Śmiałkowska B.

bsmialkowska@wi.zut.edu.pl

Faculty of Computer Science West Pomeranian University of Technology ul. Żołnierska 49, Szczecin, Poland

autor

Gibert M.

mgibert.wi.zut.edu.pl

Faculty of Computer Science West Pomeranian University of Technology ul. Żołnierska 49, Szczecin, Poland

Bibliografia

[1] Jackson P. i Moulinier I., Natural Language Processing for Online Applications: Text Retdeval, Extraction, and Categorization. John Benjamins Publishing, 2002.
[2] Lewis D. D., Representation quality in text classification: an introduction and experiment, w Proceedings of the workshop on Speech and Natural Language, Stroudsburg, PA, USA, 1990, pp. 288-295.
[3] Sebastiani F., Machine learning in automated text categorization, ACM Comput Surv, t. 34, nr 1, ss. 1—47, mar. 2002.
[4] Metzler D., Dumais S., i Meek C., Similarity Measures for Short Segments of Text, in Advances in Information Retrieval, G. Amati, C. Carpineto, i G. Romano, Red. Springer Berlin Heidelberg, 2007, pp. 16-27.
[5] Stefano Ferilli M. B., Combining Qualitative and Quantitative Keyword Extraction Methods with Document Layout Analysis'", pp. 22—33, 2009.
[6] Salton G., Wong A., i Yang C. S., A vector space model for automatic indexing, Commun ACM, t. 18, nr 11, ss. 613-620, lis. 1975.
[7] Deerwester S., Dumais S. T., Fumas G. W., Landauer T. K., i Harshman R., Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci., t. 41, nr 6, ss. 391-407, 1990.
[8] Hayes P. J. i Weinstein S. P., CONSTRUE/T1S: A System for Content-Based Indexing of a Database of News Stories, in Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, 1991, ss. 49-64.
[9] Lubaszewski W., Słowniki komputerowe i automatyczna ekstrakcja informacji z tekstu. AGH Uczelniane Wydawnictwa Naukowo-Dydaktyczne, 2009.
[10] Xia T. i Chai Y., An improvement to TF-1DF: Term Distribution based Term Weight Algorithm, J. Softw., t. 6, nr 3, mar. 2011.
[11] Landauer T., Foltz P., i Laham D., An Introduction to Latent Semantic Analysis, Discourse Process., nr 25, ss, 259-284, 1998.
[12] Guo G., Wang H., Bell D., Bi Y., i Greer K., Using kNN model for automatic text categorization, Soft Comput., t. 10, nr 5, ss. 423-430, mar. 2006.
[13] Raghavan V., Bollmann P., i Jung G. S., A critical investigation of recall and precision as measures of retrieval system performance, ACM Trans Inf Syst, t. 7, nr 3, ss. 205-229, lip. 1989.
[14] Berry M. W., Kogan J., i SIAM International Conference on Data Mining, Text mining applications and theory. Chichester, U.K.: Wiley, 2010.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-7266afcd-fa7a-4b8c-8c9e-d9eea5e85fcf