Metody klasyfikacji dokumentów tekstowych

Chrabąszcz, M.; Gołębski, M.; Bembenik, R.

Artykuł - szczegóły

Tytuł artykułu

Metody klasyfikacji dokumentów tekstowych

Autorzy

Chrabąszcz M. , Gołębski M. , Bembenik R.

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Artykuł dotyczy problematyki automatycznej kategoryzacji dokumentów tekstowych. W pierwszej części artykułu przedstawiono najczęściej stosowane w tej dziedzinie algorytmy. Opisane zostały algorytmy Naive Bayes, Rocchio, KNN oraz SVM. Omówiono również metody łączenia klasyfikatorów, w tym takie metody, jak bagging i boosting, a takie algorytm AdaBoost. Druga część dotyczy sposobów tworzenia reprezentacji dokumentów. W artykule zaprezentowano pokrótce problemy związane z wyborem atrybutów służących do klasyfikacji dokumentów i metody tworzenia reprezentacji wektorowych.

Słowa kluczowe

klasyfikacja dokumentów algorytmy klasyfikacji dokumentów tworzenie reprezentacji dokumentów tekstowych uczenie się maszyn

Wydawca

Wydawnictwo Politechniki Częstochowskiej

Czasopismo

Informatyka Teoretyczna i Stosowana

Rocznik

2002

Tom

R. 2, nr 3

Strony

89--100

Opis fizyczny

Bibliogr. 34 poz., 1 rys.

Twórcy

autor

Chrabąszcz M.

Instytut Informatyki, Politechnika Warszawska ul. Nowowiejska 15/19, 00-665 Warszawa

autor

Gołębski M.

Instytut Informatyki, Politechnika Warszawska ul. Nowowiejska 15/19, 00-665 Warszawa

autor

Bembenik R.

Instytut Informatyki, Politechnika Warszawska ul. Nowowiejska 15/19, 00-665 Warszawa

Bibliografia

[1] Aas K., Eikvil L., Text categorisation: A survey, Technical report, Norwegian Computing Center, June 1999.
[2] Ahonen H., Heinonen O., Klementinen M., Verkano I., Applying Data Mining Techniques in Text Analysis, University of Helsinki 1999.
[3] Brin S., Page L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of Seventh International Web Conf. 1998.
[4] Budzik, J., Hammond K., Learning for Question Answering and Text Classification, AAAI Workshop on Learning for Text Categorization, AAAI Technical Report WS-98-05, 1998.
[5] Burges C., A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 1998.
[6] Cichosz P., Systemy uczące się, WNT, Warszawa 2000.
[7] Cohen W., Singer Y., Context-Sensitive Learning Methods for Text Categorization, ACM Transactions on Information Systems 1999, April, 17, 2.
[8] Croft B., Larkey L., Combining classifiers in text categorization, Conf. Proc., SIGIR’96, 1996.
[9] Chakrabarti S., Dom B., Indyk P., Enhanced hypertext categorization using hyperlinks, ACM International Conference on Management of Data, SIGMOD-98, ACM Press 1998, 307-318.
[10] Dumais S., A Bayesian Approach To Filtering Junk Email, Microsoft Research 1998.
[11] Dumais S., Using SVMs for text categorization, IEEE Intelligent Systems 1998, July/ August, 21-23.
[12] Gale W., Lewis D., A sequential algorithm for training text classifiers, Conf. Proc., SIGIR’94, 1994.
[13] Gawrysiak P., Using Data Mining Methodology for Text Retrieval, International Information Science and Education Conference Proc., DIBS’99, 1999.
[14] Gawrysiak P., Automatyczna kategoryzacja dokumentów, Praca doktorska, Politechnika Warszawska, Warszawa 2001.
[15] Gravano L., Garcia-Molina H., Tomasic A., Gloss: Text-Source Discovery over the Internet, ACM Transactions on Database Systems 1999, 24, 229-264.
[16] Joachims T., Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proceedings of 10th European Conf. On Machine Learning, 1998.
[17] Koller D., Sahami M., Hierarchically classifying documents using very few words, Proceedings of 14th Conf. On Machine Learning, 1997.
[18] Lewis D., Data Extraction as Text Categorization: An Experiment with the MUC-3 Corpus, Proceedings of Third Message Understanding Conference, San Diego 1991, 245-255.
[19] Lewis D., Feature Selection and Feature Extraction for Text Categorization, Proceedings of Speech and Natural Language Workshop, Harriman, New York 1992, 212-217.
[20] Lewis D., Evaluating and Optimizing Autonomous Text Classification Systems, Conf. Proc. SIGIR’95, 1995, 246-254.
[21] Lewis D., Ringuette M., A Comparison of Two Learning Algorithms for Text Categorization, SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
[22] Li Y.H., Jain A.K., Classification of Text Documents, The Computer Journal 1998, 41, 8.
[23] Matwin S., Scott S., Text Classification Using WordNet Hypernyms, Computer Science Dept., University of Ottawa 1998.
[24] McCallum A., Nigam K., A Comparison of Event Models for Naive Bayes Text Classification, AAAI-98 Workshop on Learning for Text Categorization, 1998.
[25] Mladenic D., Turning Yahoo into an Automatic Web-Page Classifier, European Conference on Artificial Intelligence, 1998.
[26] Mladenic D., Grobelnik M., Efficient text categorization, Text Mining Workshop on the 10th European Conference on Machine Learning ECML98, 1998.
[27] Mladenic D., Globelnik M., Feature selection for classification based on text hierarchy, Conference on Automated Learning and Discovery CONALD-98, 1998.
[28] Sebastiani F., Machine Learning in Automated Text Categorization, ACM Computing Surveys 2002, 34, 1, March 2002, 1-47.
[29] Hoch R., Using IR Techniques for Text Classification in Document Analysis, 17th International Conference on Research and Development in Information Retrieval SIGIR’94, Dublin 1994.
[30] Riloff E., Little Words Can Make a Big Difference for Text Classification, Conf. Proc., SIGIR’95, 1995.
[31] Salton G., Developments in automatic text retrieval, Science, 253, 974-979.
[32] Yang Y., Liu X., A re-examination of text categorization methods, ACM SIGIR Conference on Research and Development in Information Retrieval, New York 1998.
[33] Yang Y., Pedersen J., A Comparative Study on Feature Selection in Text Categorization, Proceedings of ICML-97, 14th International Conference on Machine Learning 1997.
[34] Zhu X., Gauch S., Gerhard L., Kral N., Pretschner A., Ontology-based web site mapping for information exploration, The 8th International Conference on Information and Knowledge Management (CIKM’99), Kansas City 1999, 188-194.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BPG4-0015-0008