A Similarity Based Supervised Decision Rule for Qualitative Improvement of Text Categorization

Basu, T.; Murthy, C. A.

doi:10.3233/FI-2015-1276

Powiadomienia systemowe

Sesja wygasła!
Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

A Similarity Based Supervised Decision Rule for Qualitative Improvement of Text Categorization

Autorzy

Basu T. , Murthy C. A.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2015-1276

Warianty tytułu

Języki publikacji

Abstrakty

The similarity based decision rule computes the similarity between a new test document and the existing documents of the training set that belong to various categories. The new document is grouped to a particular category in which it has maximum number of similar documents. A document similarity based supervised decision rule for text categorization is proposed in this article. The similarity measure determine the similarity between two documents by finding their distances with all the documents of training set and it can explicitly identify two dissimilar documents. The decision rule assigns a test document to the best one among the competing categories, if the best category beats the next competing category by a previously fixed margin. Thus the proposed rule enhances the certainty of the decision. The salient feature of the decision rule is that, it never assigns a document arbitrarily to a category when the decision is not so certain. The performance of the proposed decision rule for text categorization is compared with some well known classification techniques e.g., k-nearest neighbor decision rule, support vector machine, naive bayes etc. using various TREC and Reuter corpora. The empirical results have shown that the proposed method performs significantly better than the other classifiers for text categorization.

Słowa kluczowe

document similarity text categorization decision rule text mining

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2015

Tom

Vol. 141, nr 4

Strony

275--295

Opis fizyczny

Bibliogr. 33 poz., tab.

Twórcy

autor

Basu T.

mailtanmaybasu@gmail.com

Machine Intelligence Unit Indian Statistical Institute, India

autor

Murthy C. A.

Machine Intelligence Unit Indian Statistical Institute, India

Bibliografia

[1] Al-Mubaid, H., Umair, S. A.: A New Text Categorization Technique Using Distributional Clustering and Learning Logic, IEEE Transactions on Knowledge and Data Engineering, 18(9), 2006, 1156–1165.
[2] Bailey, T., Jain, A.: A Note on Distance Weighted K Nearest Neighbor Rule, IEEE Transactions on Systems, Man, Cybernetics, 8, 1978, 311–313.
[3] Baoli, L., Qin, L., Shiwen, Y.: An Adaptive k-Nearest Neighbor Text Categorization Strategy, ACM Transactions on Asian Language Information Processing, 3(4), 2004, 215–226.
[4] Basu, T., Murthy, C.: CUES: A New Hierarchical Approach for Document Clustering, Journal of Pattern Recognition Research, 8(1), 2013, 66–84.
[5] Basu, T., Murthy, C.: Towards Enriching the Quality of k-Nearest Neighbor Rule for Document Classification, International Journal of Machine Learning and Cybernetics, 5(6), 2014, 897–905.
[6] Basu, T., Murthy, C., Chakraborty, H.: A Tweak on K-Nearest Neighbor Decision Rule, Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition,, USA, 2012.
[7] Basu, T., Murthy, C. A.: Effective Text Classification by a Supervised Feature Selection Approach, Proceedings of the IEEE International Conference on Data Mining Workshops, Belgium,, 2012.
[8] Basu, T., Murthy, C. A.: A Similarity Assessment Technique for Effective Grouping of Documents, Information Sciences, 311, 2015, 149–162.
[9] Boley, D., Gini, M., Gross, R., Han, E. H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Document Categorization and Query Generation on the World Wide Web Using WebACE, Journal of Artificial Intelligence Review - Special Issue on Data Mining on The Internet, 3(5-6), 1999, 365–391.
[10] Burges, C. J. C.: A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2, 1998, 121–167.
[11] Cai, D., He, X., Han, J.: Document Clustering Using Locality Preserving Indexing, IEEE Transactions on Knowledge and Data Engineering, 17(12), 2005.
[12] Chang, C., Lin, C. J.: LIBSVM: A Library for Support Vector Machines, ACM Transactions on Intelligent Systems and Technology, 2(3), 2011, 1–27.
[13] Chen, L., Guo, G., Wang, K.: Class Dependent Projection Based Method for Text Categorization, Pattern Recognition Letters, 32(11), 2011, 1493–1501.
[14] Dasarathy, B. V.: Nearest Neighbor NN Norms: NN Pattern Classification Techniques, McGraw-Hill Computer Science Series. IEEE CS Press, 1991.
[15] Dhurandhar, A., Dobra, A.: Probabilistic Characterization of Nearest Neighbor Classifiers, International Journal of Machine Learning and Cybernetics, 4(4), 2012, 259–272.
[16] Dudani, S. A.: The Distance Weighted K Nearest Neighbor Rule, IEEE Transactions on Systems, Man, Cybernetics, SMC-6, 1976, 325–327.
[17] Felici, G., Sun, F., Truemper, K.: A Method for Controlling Errors in Two-Class Classification, Proceedings of Computer Software and Applications Conference, 1999.
[18] Fukunaga, K.: Introduction to Statistical Pattern Recognition, New York Academic Press, 1990.
[19] Huang, A.: Similarity Measures for Text Document Clustering, Proceedings of the New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, 2008.
[20] Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proceedings of the European Conference on Machine Learning,, Berlin, Germany, 1998.
[21] Karypis, G., Han, E. H.: Fast Supervised Dimensionality Reduction Algorithm with Applications to Document Categorization and Retrieval, Proceedings of the ACM Conference on Information and Knowledge Management, 2000.
[22] Lehmann, E. L.: Testing of Statistical Hypotheses, New York: John Wiley, 1976.
[23] Manning, C. D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval, Cambridge University Press, New York, 2008.
[24] Morin, R. L., Raeside, D. E.: A Reappraisal of Distance Weighted k-Nearest Neighbor Classification for Pattern Recognition with Missing Data, IEEE Transactions on Systems, Man, Cybernetics, 11(3), 1981, 241–243.
[25] Porter, M. F.: An Algorithm for Suffix Stripping, Program, 14(3), 1980, 130–137.
[26] Pun, T.: A New Method for Grey-level Picture Thresholding using the Entropy of the Histogram, Signal Processing, 2, 1980, 223–237.
[27] Rao, C. R., Mitra, S. K., Matthai, A., Ramamurthy, K. G., Eds.: Formulae and Tables for Statistical Work, Statistical Publishing Society, Calcutta, 1966.
[28] Salton, G., McGill, M. J.: Introduction to Modern Information Retrieval, McGraw Hill, 1983.
[29] Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing, Communications of ACM, 18(11), 1975, 613–620.
[30] Tan, F., Fu, X., Zhang, Y., Bourgeois, A. G.: A Genetic Algorithm-based Method for Feature Subset Selection, Soft Computing 12:, 12(2), 2007, 111–120.
[31] TREC, Ed.: Text REtrieval Conference, http://trec.nist.gov.
[32] Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, Kluwer Academic Publishers, 1(1-2), 1999, 69–90.
[33] Zhang, J., Chen, L., Guo, G.: Projected-Prototype based Classifier for Text Categorization, Knowledge-Based Systems, 49, 2013, 179–189.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-4b9e8949-6899-42b1-9aca-b191a5dc9b44