A Review of Artificial Intelligence Algorithms in Document Classification

Bilski, A.

Artykuł - szczegóły

Tytuł artykułu

A Review of Artificial Intelligence Algorithms in Document Classification

Autorzy

Bilski A.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

With the evolution of Internet, the meaning and accessibility of text documents and electronic information has increased. The automatic text categorization methods became essential in the information organization and data mining process. A proper classification of e-documents, various Internet information, blogs, emails and digital libraries requires application of data mining and machine learning algorithms to retrieve the desired data. The following paper describes the most important techniques and methodologies used for the text classification. Advantages and effectiveness of contemporary algorithms are compared and their most notable applications presented.

Słowa kluczowe

classifier text classification data mining information retrieval machine learning algorithms

Wydawca

Polish Academy of Sciences, Committee of Electronics and Telecommunication

Czasopismo

International Journal of Electronics and Telecommunications

Rocznik

2011

Tom

Vol. 57, No. 3

Strony

263--270

Opis fizyczny

Bibliogr. 53 poz., wykr.

Twórcy

autor

Bilski A.

Department of Applied Informatics, Warsaw University of Life Sciences, Nowoursynowska 159, 02-767 Warsaw, Poland, blindman26@o2.pl

Bibliografia

[1] T. Yan and H. Molina, „Sift-a tool for wide-area information dissemination”, in Proc. 1995 USENIX Technical Conf., 1995, pp. 177 - 186.
[2] K. Lang, „Newsweeder: learning to filter netnews”, in Proc. 12th Int. Conf. on Machine Learning, 1995, pp. 331 - 339.
[3] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, „A noval feature selection algorithm for text categorization”, Elsevier, science Direct Expert system with application, vol. 33, pp. 1 - 5, 2006.
[4] S. Chakrabarti, S. Roy, and M. V. Soundalgekar, „Fast and accurate text classification via multiple linear discriminant projection”, The International Journal on Very Large Data Bases (VLDB), pp. 170 - 185, 2003.
[5] S. Weiss, S. Kasif, and E. Brill, „Text classification in usenet newsgroup: a progress report”, in AAAI Spring Symp. on Machine Learning in Information Access Technical Papers, Palo Alto, Mar 1996.
[6] D. Hull, J. Pedersen, and H. Schutze, „Document routing as statistical classification”, in AAAI Spring Symp. On Machine Learning in Information Access Technical Papers, Palo Alto, Mar 1996.
[7] C. Faloutsos and D. Oard, „A survey of information retrieval and filtering methods”, University of Maryland, MA, Tech. Rep. CS-TR-3541, 1995.
[8] E. Montanes, J. Ferandez, I. Diaz, E. F. Combarro, and J. Ranilla, „Measures of rule quality for feature selection in text categorization”, in 5th International Symposium on Intelligent data analysis. Germany: Springer-Verlag, 2003, pp. 589 - 598.
[9] C. Fox, „Lexical analysis and stoplist”, in Information Retrieval Data Structures and Algorithms, W. Frakes and R. Baeza-Yates, Eds. Prentice Hall, 1992, pp. 102 - 130.
[10] S. Geisser, Predictive Inference. NY: Chapman and Hall, 1992.
[11] H. Liu and Motoda, Feature Extraction, construction and selection: A Data Mining Perspective. Boston, Massachusetts: Springer, 1998.
[12] Y. Wang and X. Wang, „A new approach to feature selection in text classification”, in Proceedings of 4th International Conference on Machine Learning and Cybernetics, vol. 6, 2005, pp. 3814 - 3819.
[13] K. Aurangzeb, B. Baharum, L. H. Lee, and K. Khairullah, „A review of machine learning algorithms for text-documents classification”, Journal of Advances in Information Technology, vol. 1, no. 1, Feb 2010.
[14] Z.-Q. Wang, X. Sun, D.-X. Zhang, and X. Li, „An optimal svmbased text classification algorithm”, in Fifth International Conference on Machine Learning and Cybernetics, 2006, pp. 13 - 16.
[15] E. R. Miguel and S. Padmini, „Automatic text classifiers for text categorization”, in Information Retrieval. Kluwer Academic Publishers Hingham, Jan 2002, no. 1, pp. 87 - 118.
[16] P. Myllymaki and H. Tirri, „Bayesian case-based reasoning with neural network”, in Proceedings of the IEEE International conference on Neural Nerwork ’93, vol. 1, 1993, pp. 422 - 427.
[17] B. Yu, Z. ben Xu, and C. hua Li, „Latent semantic analysis for text categorization using neural network”, Knowledge-Based Systems, vol. 21, pp. 900 - 904, 2008.
[18] V. Tam, A. Santoso, and R. Setiono, „A comparative study of centroidbased, neighborhood-based and statistical approaches for effective document categorization”, in Proceedings of the 16th International Conference on Pattern Recognition, 2002, pp. 235 - 238.
[19] P. Cichosz, Systemy uczące się. Warsaw, Poland: Wydawnictwa Naukowo-Techniczne Warszawa, 2000, in Polish.
[20] M. Changa and C. K. Poon, „Using phrases as fetures in email classification”, The Journal of Systems International Conference on Research and Development in Informational Retrieval, pp. 307 - 315, 1996.
[21] T. Joachims, „Text categorization with support vector machines: Learning with many relevant features”, in European Conference on Machine Learning, Chemnitz, Germany, 1998, pp. 137 - 142.
[22] H. Kim and S. Chen, „Associative naive bayes classifier: Automated linking of gene ontology to medline documents”, Pattern Recognition, pp. 1777 - 1785, 2009.
[23] C. Apte, F. Damerau, and S. M. Weiss, „Towards language independent automated learning of text categorization models”, in Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 23 - 30.
[24] C. Apte, F. Damerau, and S. M. Weiss, „Automated learning of decision rules for text categorization”, ACM Transactions on Information Systems (TOIS), vol. 12, no. 3, pp. 233 - 251, 1994.
[25] C.-H. Wu, „Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks”, Expert Systems with Applications, pp. 4321 - 4330, 2009.
[26] A. McCallum and K. Nigam, „A comparison of event models for naive bayes text classification”, Journal of Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[27] I. Rish, „An empirical study of the naive bayes classifier”, in Proceedings of the IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence, 2001.
[28] D. Isa, L. H. Iee, V. P. Kallimani, and R. RajKumar, „Text documents preprocessing with the bahes formula for classification using the support vector machine”, IEEE, Traction of Knowledge and Data Engineering, vol. 20, pp. 1264 - 1272, 2008.
[29] D. Isa, V. P. Kallimani, and L. H. Iee, „Using self organizing map for clustering of text documents”, Elsever, Expert System with Applications, 2008.
[30] P. Domingos and M. J. Pazzani, „On the optimality of the simple bayesian classifier under zero-one loss”, Machine Learning, vol. 29, pp. 103 - 130, 1997.
[31] T. S. Guzella and W. M. Caminhas, „A review of machine learning approches to spam filtering”, Elsever, Expert System with Applications, 2009.
[32] V. N. Vapnik, The Nature of Statistical Learning Theory. NY: Springer, 1995.
[33] P. Bilski, „Automated selection of kernel parameters in diagnostics of analog systems”, Electrical Review, vol. 5, pp. 9 - 13, 2011.
[34] H. Brcher, G. Knolmayer, and M.-A. Mittermayer, „Document classification methods for organizing explicit knowledge', in Proceedings of the Third European Conference on Organizational Knowledge, Learning, and Capabilities. Athens, Greece: University of Bern, 2002.
[35] S. Sahay. Support vector machines and document classification. [Online]. Available: http://www-static.cc.gatech.edu/ ssahay/sauravsahay7001-2.pdf
[36] C.-H. Lee and H.-C. Yang, „Construction of supervised and unsupervised learning systems for multilingual text categorization”, Expert Systems with Applications, pp. 2400 - 2410, 2009.
[37] S.-J. Wang, A. Mathew, Y. Chen, L.-F. Xi, L. Ma, and J. Lee, „Empirical analysis of support vector machine ensemble classifiers”, Expert Systems with Applications, pp. 6466 - 6476, 2009.
[38] M. Ikonomakis, S. Kotsiantis, and V. Tampakas, „Text classification using machine learning techniques”, Wseas Transactions on Computers, vol. 4, no. 8, pp. 966 - 974, 2005.
[39] B. C. How and W. T. Kiong, „An examination of feature selection frameworks in text categorization”, AIRS, pp. 558 - 564, 2005.
[40] S. M. Kamruzzaman and F. Haider, „Hybrid learning algorithm for text classification”, in 3rd International Conference on Electrical and Computer Engineering ICECE 2004, Dhaka, Bangladesh, Dec 2004.
[41] D. Miao, Q. Duan, H. Zhang, and N. Jiao, „Rough set based hybrid algorithm for text classification”, Expert Systems with Applications, 2009.
[42] A. Markov and M. Last, „A simple, structure-sensitive approach for web document classification”, in Atlantic Web Intelligence Conference - AWIC, 2005, pp. 293 - 298.
[43] C. H. Li and S. C. Park, „Combination of modified bpnn algorithms and an efficient feature selection method for text categorization”, Information Processing and Management, vol. 45, pp. 329 - 340, 2009.
[44] W. Shang, H. Huang, H. Zhu, Y. L. Y. Qu, and H. Dong, „An adaptive fuzzy knn text classifier”, in International Conference on Computational Science (3) '06, 2006, pp. 216 - 223.
[45] K. H. Lee, J. Kay, B. H. Kang, and U. Rosebrock, A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. Heidelberg, Berlin: Springer-Verlag, 2002.
[46] B. Liu, W. S. Lee, P. Yu, and X. Li, „Partially supervised classification of text documents”, in ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
[47] M. Li, H. Li, and Z.-H. Zhou, „Semi-supervised document retrieval”, Information Processing and Management, 2008.
[48] W. Wu, Q. Gao, and M. Wang, „An efficient feature selection method for classification data mining”, WSEAS Transactions on Information Science and Applications, vol. 3, pp. 2034 - 2040, 2006.
[49] A. Yah, L. Hirschman, and A. Morgan, „Evaluation of text data mining for database curation: lessons learned from the kdd challenge cup”, Bioinformatics, vol. 19, pp. i331 - i339, 2003.
[50] Y. Yang and X. Liu, „An re-examination of text categorization”, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, Aug 1999, pp. 42 - 49.
[51] P. Yuan, Y. Chen, H. Jin, and L. Huang, „Msvm-knn: Combining svm and k-nn for multi-class text classification”, in IEEE International Workshop on Semantic Computing and Systems, 2008, pp. 133 - 140.
[52] F. Colas and P. Brazdil, „Comparison of svm and some older classification algorithms in text classification tasks”, in Artificial Intelligence in Theory and Practice. IFIP International Federation for Information Processing, 2006, pp. 169 - 178.
[53] Z.-F. Zhu, P.-Y. Liu, and L. Ran, „Research of text classification technology based on genetic annealing algorithm”, in International Symposium on Computational Intelligence and Design, vol. 1, 2008, pp. 265 - 269.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BWAK-0026-0004