Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2007 | Vol. 36, no 4 | 1009-1035
Tytuł artykułu

Extending k-means with the description comes first approach

Treść / Zawartość
Warianty tytułu
Języki publikacji
This paper describes a technique for clustering large collections of short and medium length text documents such as press articles, news stories and the like. The technique called description comes first (DCF) consists of identification of related document clusters, selection of salient phrases relevant to these clusters and reallocation of documents matching the selected phrases to form final document groups. The advantages of this technique include more comprehensive cluster labels and clearer (more transparent) relationship between cluster labels and their content. We demonstrate the DCF by taking a standard k-means algorithm as a baseline and weaving DCF elements into it; the outcome is the descriptive k-means (DKM) algorithm. The paper goes through technical background explaining how to implement DKM efficiently and ends with the description of an experiment measuring clustering quality on a benchmark document collection 20-newsgroups. Short fragments of this paper appeared at the poster session of the RIAO 2007 conference, Pittsburgh, PA, USA (electronic proceedings only).

Opis fizyczny
Bibliogr. 30 poz., rys., wykr.
  • ABNEY, S. (1991) Parsing by Chunks. In: R.C. Berwick, S.P. Abney and C. Tenny, eds., Principle-Based Parsing: Computation and Psycholinguistics. Kluwer Academic Publishers, 257-278.
  • ARTHUR, D. and VASSILVITSKII, S. (2006) On the worst case complexity of the k-means method. 22nd Annual ACM Symposium on Computational Geometry.
  • BAEZA-YATES, R.A. and RIBEIRO-NETO, B.A. (1999) Modern Information Retrieval. ACM Press, Addison-Wesley.
  • BEKKERMAN, R., RAGHAVAN, H., ALLAN. J. and EGUCHI, K. (2007) Interactive clustering of text collections according to a user-specified criterion. In: Proceedings of IJCAI-07, the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India.
  • CARPINETO, C. and ROMANO, G. (2004) Exploiting the Potential of Concept Lattices for Information Retrieval with CREDO. Journal of Universal Computer Science 10 (8), 985-1013.
  • CHENG, D., VEMPALA,S., KANNAN,R. and WANG,G. (2005) A Divide-And-Merge Methodology for Clustering. In: Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, USA, ACM Press, 196-205.
  • DEAN. J. and GHEMAWAT, S. (2004) Mapreduce: Simplified data processing on large clusters. In: OSDI’04: Sixth Symposium, on Operating System Design and Implementation.
  • DHILLON, I.S. and MODHA, D.S. (2001) Concept decompositions for large sparse text data using clustering. Machine Learning 42 1-2, 143 175.
  • DHILLON, I.S., FAN, J. and GUAN, Y. (2001) Efficient Clustering of Very Large Document Collections. In: V.K.R. Grossman, C. Kamath and R. Namburu, eds., Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers.
  • FERRAGINA, P. and GULLI, A. (2004) The Anatomy of SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy. Lecture Notes in Computer Science 3202, Springer, 506-508.
  • GOSWAMI, A., JIN, R. and AGRAWAL, G. (2004) Fast and Exact Out-of-Core K-Means Clustering. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1-4 November 2004, Brighton, UK. IEEE Computer Society, 83- 90.
  • HOTHO, A., STAAB, S. and STUMME, G. (2003) Explaining Text Clustering Results Using Semantic Structures. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia. Lecture Notes in Computer Science 2838, Springer, 217-228.
  • HOTHO, A. and STUMME, G. (2002) Conceptual Clustering of Text Clusters. In: Proceedings of FGML Workshop, Special Interest Croup of German Informatics Society, Hannover, Germany, 37-45.
  • KORONACKI, J. and MIELNICZUK, J. (2001) Statystyka dla studentów kierunków technicznych i przyrodniczych. Wydawnictwa Naukowo-Technicznc WNT.
  • KUMMAMURU, K., LOTLIKAR, R., ROY S., SINGAL, K. and KRISHNAPURAM, R. (2004) A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In: Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA. ACM Press, 658-665.
  • LARSSON, J.N. (1999) Structures of String Matching and Data Compression. PhD thesis, Department of Computer Science, Lund University.
  • MANNING, C.D., RAGHAVAN, P. and SCHÜTZE, H. (2008) Introduction to Information Retrieval. Cambridge University Press, (to appear).
  • MANNING, C.D. and SCHÜTZE, H. (1999) Foundations of Statistical Natural Language Processing. MIT Press.
  • OSIŃSKI, S., STEFANOWSKI, J. and WEISS, D. (2004) Lingo: Search results clustering algorithm based on Singular Value Decomposition. In: M.A. Kłopotek, S.T. Wierzchoń, and K. Trojanowski, eds., Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference (Zakopane. Poland). Advances in Soft Computing, Springer, 359-368.
  • OSIŃSKI, S. and WEISS, D. (2005) A concept-driven algorithm for clustering search results. IEEE Intelligent Systems 20 (3), 48-54.
  • PANTEL, P. and LIN, D. (2002) Document Clustering With Committees. In: Proceedings of the 25th ACM International Conference on Research and Development in Information Retrieval, Tampere, Finland. ACM Press, 199-206.
  • PELLEG, D. and MOORE, A. (1999) Accelerating Exact k-means Algorithms with Geometric Reasoning. In: Knowledge Discovery and Data Mining, 277-281.
  • SALTON, G. (1989) Automatic Text Processing - The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley.
  • SANDERSON, M. and CROFT, B. (1999) Deriving Concept Hierarchies from Text. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, USA. 206-213.
  • SCHÜTZE, H. and SILVERSTEIN, C. (1997) Projections for Efficient Document Clustering. In: Proceedings of the 20th ACM International Conference on Research and Development in Information Retrieval, Philadelphia, PA, USA. ACM Press, 74-81.
  • UKKONEN, E. (1995) On-Line Construction of Suffix Frees. Algorithmica 14 3, 249-260.
  • WEISS, D. (2006) Descriptive Clustering as a Method for Exploring Text Collections. PhD thesis, Poznań University of Technology, Poznań, Poland.
  • ZAMIR, O. and ETZIONI, O. (1999) Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks 31, 11-16, 1361-1374.
  • ZENG, H.-J., HE, Q.-C., CHEN, Z., MA, W.-Y. and MA, J. (2004) Learning to Cluster Web Search Results. In: Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, ACM Press, 210-217.
  • ZHANG, T., DAMERAU, F. and JOHNSON, D. (2002) Text Chunking Based on a Generalization of Winnow. Journal of Machine Learning Research 2, 615 637.
Typ dokumentu
Identyfikator YADDA
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.