Document Clustering : Concepts, Metrics and Algorithms

Tarczynski, T.

Artykuł - szczegóły

Tytuł artykułu

Document Clustering : Concepts, Metrics and Algorithms

Autorzy

Tarczynski T.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Document clustering, which is also refered to as text clustering, is a technique of unsupervised document organisation. Text clustering is used to group documents into subsets that consist of texts that are similar to each orher. These subsets are called clusters. Document clustering algorithms are widely used in web searching engines to produce results relevant to a query. An example of practical use of those techniques are Yahoo! hierarchies of documents [1]. Another application of document clustering is browsing which is defined as searching session without well specific goal. The browsing techniques heavily relies on document clustering. In this article we examine the most important concepts related to document clustering. Besides the algorithms we present comprehensive discussion about representation of documents, calculation of similarity between documents and evaluation of clusters quality.

Słowa kluczowe

document clustering text mining k-means hierarchical clustersting vector space model

Wydawca

Polish Academy of Sciences, Committee of Electronics and Telecommunication

Czasopismo

International Journal of Electronics and Telecommunications

Rocznik

2011

Tom

Vol. 57, No. 3

Strony

271--277

Opis fizyczny

Bibliogr. 38 poz., rys.

Twórcy

autor

Tarczynski T.

Department of Applied Informatics, Warsaw University of Life Sciences, ul. Nowoursynowska 159, 02-767 Warsaw, Poland, tomek.tarczynski@gmail.com

Bibliografia

[1] Y. Labrou and T. Finin, „Yahoo! as an ontology: using yahoo! Categories to describe documents”, in Proceedings of the eighth international conference on Information and knowledge management, ser. CIKM '99. New York, NY, USA: ACM, 1999, pp. 180 - 187. [Online]. Available: http://doi.acm.org/10.1145/319950.319976
[2] A. K. Jain, M. N. Murty, and P. J. Flynn, „Data clustering: a review”, ACM Comput. Surv., vol. 31, pp. 264 - 323, September 1999. [Online]. Available: http://doi.acm.org/10.1145/331499.331504
[3] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey, „Scatter/gather: a cluster-based approach to browsing large document collections”, in Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR '92, New York, NY, USA, 1992, pp. 318 - 329. [Online]. Available: http://doi.acm.org/10.1145/133160.133214
[4] G. Salton, A. Wong, and C. S. Yang, „A vector space model for automatic indexing”, Commun. ACM, vol. 18, pp. 613 - 620, November 1975. [Online]. Available: http://doi.acm.org/10.1145/361219.361220
[5] G. Salton and C. Buckley, „Term weighting approaches in automatic text retrieval”, Cornell University, Ithaca, NY, USA, Tech. Rep., 1987.
[6] S. K. M. Wong, W. Ziarko, V. V. Raghavan, and P. C. N. Wong, „On modeling of information retrieval concepts in vector spaces”, ACM Trans. Database Syst., vol. 12, pp. 299 - 321, June 1987. [Online]. Available: http://doi.acm.org/10.1145/22952.22957
[7] X. Tai, M. Sasaki, Y. Tanaka, and K. Kita, „Improvement of vector space information retrieval model based on supervised learning”, in Proceedings of the fifth international workshop on Information retrieval with Asian languages, ser. IRAL '00, New York, NY, USA, 2000, pp. 69 - 74. [Online]. Available: http://doi.acm.org/10.1145/355214.355224
[8] G. Salton, Ed., Automatic text processing. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1988.
[9] Y. Zhao and G. Karypis, „Empirical and theoretical comparisons of selected criterion functions for document clustering”, Mach. Learn., vol. 55, pp. 311 - 331, June 2004. [Online]. Available: http://portal.acm.org/citation.cfm?id=990375.990398
[10] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma, „Learning to cluster web search results”, in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR ’04, New York, NY, USA, 2004, pp. 210 – 217. [Online]. Available: http://doi.acm.org/10.1145/1008992.1009030
[11] C. F. Olson, „Parallel algorithms for hierarchical clustering”, Parallel Comput., vol. 21, 1995.
[12] C. J. van Rijsbergen, Information Retrieval, 2nd ed. Newton, MA, USA: Butterworth-Heinemann, 1979.
[13] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel, „Performance measures for information extraction”, in In Proceedings of DARPA Broadcast News Workshop, 1999, pp. 249 - 252.
[14] A. El-Hamdouchi and P. Willett, „Comparison of hierarchic agglomerative clustering methods for document retrieval”, The Computer Journal, vol. 32, pp. 220 - 227, Jan 1989. [Online]. Available: http://dx.doi.org/10.1093/comjnl/32.3.220
[15] M. Steinbach, G. Karypis, and V. Kumar, „A comparison of document clustering techniques”, 2000. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.1505
[16] W. H. Day and H. Edelsbrunner, „Efficient algorithms for agglomerative hierarchical clustering methods”, Journal of Classification, vol. 1, pp. 7 - 24, 1984. [Online]. Available: http://dx.doi.org/10.1007/BF01890115
[17] G. A. Wilkin and X. Huang, „A practical comparison of two k-means clustering algorithms”, BMC Bioinformatics, vol. 9, p. S19, 2008. [Online]. Available: http://www.biomedcentral.com/1471-2105/9/S6/S19
[18] J. Wu, H. Xiong, and J. Chen, „Adapting the right measures for k-means clustering”, in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD '09, New York, NY, USA, 2009, pp. 877 - 886. [Online]. Available: http://doi.acm.org/10.1145/1557019.1557115
[19] M. Chiang and B. Mirkin, „Experiments for the number of clusters in k-means”, in Progress in Artificial Intelligence, ser. Lecture Notes in Computer Science, J. Neves, M. Santos, and J. Machado, Eds. Springer Berlin / Heidelberg, 2007, vol. 4874, pp. 395 - 405. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-77002-2_33
[20] D. Arthur and S. Vassilvitskii, „k-means++: the advantages of careful seeding”, in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, ser. SODA '07, Philadelphia, PA, USA, 2007, pp. 1027 - 1035. [Online]. Available: http://portal.acm.org/citation.cfm?id=1283383.1283494
[21] R. Maitra, A. Peterson, and A. Ghosh, „A systematic evaluation of different methods for initializing the k-means clustering algorithm”, IEEE Transactions on Knowledge and Data Engineering, 2010.
[22] G. W. Milligan and P. D. Isaac, „The validation of four ultrametric clustering algorithms”, Pattern Recognition, vol. 12, no. 2, pp. 41 - 50, 1980. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0031320380900011
[23] P. S. Bradley and U. M. Fayyad, „Refining initial points for k-means clustering”, in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML '98, San Francisco, CA, USA, 1998, pp. 91 - 99. [Online]. Available: http://portal.acm.org/citation.cfm?id=645527.657466
[24] B. Mirkin, Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, 2005.
[25] D. H. Fisher, „Knowledge acquisition via incremental conceptual clustering”, Mach. Learn., vol. 2, pp. 139 - 172, September 1987. [Online]. Available: http://portal.acm.org/citation.cfm?id=639960.639990
[26] P. Cheeseman and J. Stutz, „Advances in knowledge discovery and data mining”, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA, USA: American Association for Artificial Intelligence, 1996, ch. Bayesian classification (AutoClass): theory and results, pp. 153 - 180. [Online]. Available: http://portal.acm.org/citation.cfm?id=257938.257954
[27] S. Savaresi, D. L. Boley, S. Bittanti, and G. Gazzaniga, „Choosing the cluster to split in bisecting divisive clustering algorithms”, in Second SIAM International Conference on Data Mining, 2000.
[28] M. Meila and D. Heckerman, „An experimental comparison of model-based clustering methods”, Mach. Learn., vol. 42, pp. 9 - 29, January 2001. [Online]. Available: http://portal.acm.org/citation.cfm?id=599609.599627
[29] G. Karypis, E. Han, and V. Kumar, „Chameleon: Hierarchical clustering using dynamic modeling”, Computer, vol. 32, pp. 68 - 75, August 1999. [Online]. Available: http://dx.doi.org/10.1109/2.781637
[30] D. Boley, „Principal direction divisive partitioning”, Data Min. Knowl. Discov., vol. 2, pp. 325 - 344, December 1998. [Online]. Available: http://portal.acm.org/citation.cfm?id=593421.593471
[31] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, „Bipartite graph partitioning and data clustering”, in Proceedings of the tenth international conference on Information and knowledge management, ser. CIKM '01, New York, NY, USA, 2001, pp. 25 - 32. [Online]. Available: http://doi.acm.org/10.1145/502585.502591
[32] C. H. Zha, H. Zha, X. He, C. Ding, H. Simon, and M. Gu, „Spectral relaxation for k-means clustering”, in Advances in Neural Information Processing Systems. MIT Press, 2001, pp. 1057 - 1064.
[33] I. S. Dhillon and D. S. Modha, „Concept decompositions for large sparse text data using clustering”, Mach. Learn., vol. 42, pp. 143 - 175, January 2001. [Online]. Available: http://portal.acm.org/citation.cfm?id=370660.370699
[34] O. Zamir, O. Etzioni, O. Madani, and R. M. Karp, „Fast and intuitive clustering of web documents”, in In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997, pp. 287 - 290.
[35] M. Dash, S. Petrutiu, and P. Scheuermann, „Efficient parallel hierarchical clustering”, in International Europar Conference, 2004.
[36] Y. Song, W. Chen, H. Bai, C. Lin, and E. Chang, „Parallel spectral clustering”, Machine Learning and Knowledge Discovery in Databases, pp. 374 - 389, 2008. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-87481-2_25
[37] Y. Liu, J. Mostafa, and W. Ke, „A fast online clustering algorithm for scatter/gather browsing”, 2007.
[38] D. R. Cutting, D. R. Karger, and J. O. Pedersen, „Constant interactiontime scatter/gather browsing of very large document collections”, in Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR '93, New York, NY, USA, 1993, pp. 126 - 134. [Online]. Available: http://doi.acm.org/10.1145/160688.160706

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BWAK-0026-0005