Geodesic distances for clustering linked text data

Tekir, S.; Mansmann, F.; Keimer, D.

Powiadomienia systemowe

Sesja wygasła!
Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Geodesic distances for clustering linked text data

Autorzy

Tekir S. , Mansmann F. , Keimer D.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The quality of a clustering not only depends on the chosen algorithm and its parameters, but also on the definition of the similarity of two respective objects in a dataset. Applications such as clustering of web documents is traditionally built either on textual similarity measures or on link information. Due to the incompatibility of these two information spaces, combining these two information sources in one distance measure is a challenging issue. In this paper, we thus propose a geodesic distance function that combines traditional similarity measures with link information. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on a subset of the EnglishWikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. Additionally, taking inspiration from the unified view of the performance functions of k-means and k-harmonic means, min and harmonic average of the cosine and geodesic distances are taken in order to construct alternate distance forms. The effectiveness of our approach is measured by computing microprecision values of the clusters based on the provided categorical information of each article.

Słowa kluczowe

clustering geodesic distance text data k-means algorithm cosine distance k-harmonic means microprecision values

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2012

Tom

Vol. 2, No. 3

Strony

247--258

Opis fizyczny

Bibliogr. 25 poz., rys.

Twórcy

autor

Tekir S.

selmatekir@iyte.edu.tr

Dept. of Computer Engineering Izmir Institute of Technology Izmir, Turkey 35430

autor

Mansmann F.

Florian.Mansmann@uni-konstanz.de

Faculty of Computer Science, Box 78 University of Konstanz Konstanz, Germany 78457

autor

Keimer D.

Daniel.Keim@uni-konstanz.de

Faculty of Computer Science, Box 78 University of Konstanz Konstanz, Germany 78457

Bibliografia

[1] B. A. Dubrovin, A. T. Fomenko, and S. P. Novikov, Modern Geometry - Methods and Applications: Part I: The Geometry of Surfaces, Transformation Groups, and Fields (Graduate Texts in Mathematics).Springer, 1991.
[2] G. Salton, Automatic text processing : the transformation, analysis, and retrieval of information by computer. Addison-Wesley., 1989.
[3] M. Lou, “Traffic pattern in negatively curved network,” Ph.D. dissertation, University of Southern California, 2009.
[4] L. Denoyer and P. Gallinari, “The wikipedia xml corpus,” SIGIR Forum, vol. 40, no. 1, pp. 64–69, 2006.
[5] J. Kamps and M. Koolen, “Is wikipedia link structure different?” pp. 232–241, 2009.
[6] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman, Eds., vol. 1. University of California Press, 1967, pp. 281–297.
[7] M. Hui-Fang, H. Qing, and S. Zhong-Zhi, “Geodesic distance based aproach for sentence similarity computation,” in Machine Learning and Cybernetics, 2008 International Conference on, vol. 5, 2008, pp. 2551–2557.
[8] J. Goth and A. Skrop, “Varying retrieval categoricity using hyperbolic geometry,” Inf. Retr., vol. 8, no. 2, pp. 265–283, 2005.
[9] B. Xiao and E. Hancock, “Geometric characterization of graphs,” in Image analysis and processing: ICIAP 2005, 13th international conference,Cagliari, Italy, September 6-8, 2005. Springer, 2005, pp. 471–478.
[10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web,” Stanford InfoLab, Technical Report, 1999, previous number = SIDL-WP-1999-0120.
[11] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604–632, 1999.
[12] A. Strehl, E. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measures on web-page clustering,” in In Workshop on Artificial Intelligence for Web Search (AAAI 2000, 2000, pp. 58–64.
[13] N. Oikonomakou and M. Vazirgiannis, “A review of web document clustering approaches,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer US, 2005, pp. 921–943.
[14] P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto, and M. A. Gonalves, “Combining link-based and content-based methods for web document classification,” pp. 394–401, 2003.
[15] K. Yang, “Combining text- and link-based methods for web ir.” in Proceedings of the 10th Text Rerieval Conference (TREC-10). Washington, DC.: US Government Printing Office., 2001.
[16] S. Tekir, F. Mansmann, and D. Keim, “Geodesic distances for web document clustering,” in Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on, april 2011, pp. 15 –21.
[17] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998, 10.1038/30918.
[18] J. White and D. Kalman, “Cardano: An adventure in algebra in 8 parts.”
[19] D. Modha and W. Spangler, “Feature weighting in k-means clustering,” Machine Learning, vol. 52, pp. 217–237(21), September 2003.
[20] E. Begelfor and M. Werman, “The world is not always flat or learning curved manifolds.” School of Engineering and Computer Science, Hebrew University of Jerusalem., Tech. Rep., 2005.
[21] W. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
[22] T. Cover and J. Thomas, Elements of Information Theory 2nd Edition. Wiley-Interscience, 2006.
[23] J. Wu, H. Xiong, and J. Chen, “Adapting the right measures for k-means clustering,” pp. 877–886, 2009.
[24] B. Zhang, M. Hsu, and U. Dayal, “K-harmonic means - a data clustering algorithm,” 1999.
[25] D. Turnbull, “K-means and k-harmonic means: A comparison of two unsupervised clustering algorithms.” University of California San Diego Department of Computer Science and Engineering, Tech. Rep., 2002.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-db35e218-57cf-4436-a441-cf7876e8a168