Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

XCleaner: A new method for clustering XML documents by structure

Treść / Zawartość
Warianty tytułu
Języki publikacji
With the vastly growing data resources on the Internet, XML is one of the most important standards for document management. Not only does it provide enhancements to document exchange and storage, but it is also helpful in a variety of information retrieval tasks. Document clustering is one of the most interesting research areas that utilize semi-structural nature of XML. In this paper, we put forward a new XML clustering algorithm that relies solely on document structure. We propose the use of maximal frequent subtrees and an operator called Satisf/Violate to divide documents into groups. The algorithm is experimentally evaluated on real and synthetic data sets with promising results.
Słowa kluczowe
Opis fizyczny
Bibliogr. 20 poz., il.
  • Institute of Computing Science, Poznań University of Technology Piotrowo 2, 60-965 Poznań, Poland
  • Aggarwal, C.C., Ta, N., Wang, J., Feng, J. and Zaki, M.J. (2007) Xproj: a framework for projected structural clustering of xml documents. In: P. Berkhin, R. Caruana and X. Wu, eds., KDD. ACM, 46-55.
  • Barbosa, D., Mendelzon, A.O., Keenleyside, J. and Lyons, K.A. (2002) Toxgene: a template-based data generator for xml. In: M.J. Franklin, B. Moon and A. Ailamaki, eds., SIGMOD Conference. ACM, 616.
  • Candillier, L., Tellier, I.and Torre, F. (2005) Transforming xml trees for efficient classification and clustering. In: N. Fuhr, M. Lalmas, S. Malik and G. Kazai, eds., INEX. LNCS 3977, Springer, 469-480.
  • Chawathe, S.S. (1999) Describing and manipulating xml data. IEEE Data Eng. Bull., 22 (3), 3-9.
  • Chi, Y., Xia, Y., Yang, Y.and Muntz, R.R. (2005) Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Trans. Knowl. Data Eng., 17 (2), 190-202.
  • Costa, G., Manco, G., Ortale, R. and Tagarelli, A. (2004) A tree-based approach to clustering xml documents by structure. In: J.-F. Boulicaut, F. Esposito, F. Giannotti and D. Pedreschi, eds., PKDD. LNCS 3202, Springer, 137-148.
  • Dalamagas, T., Cheng, T., Winkel, K.-J. and Sellis, T.K. (2004) Clustering xml documents by structure. In: G.A. Vouros and T. Panayiotopoulos, eds., SETN. LNCS 3025, Springer, 112-121.
  • Doucet, A. andd Ahonen-Myka, H. (2002) Naïve clustering of a large xml document collection. In: N. Fuhr, N. Gövert, G. Kazai and M. Lalmas, eds., INEX Workshop. ERCIM, European Research Consortium for Informatics and Mathematics, 81-87.
  • Florescu, D. and Kossmann, D. (1999) Storing and querying xml data using an rdmbs. IEEE Data Eng. Bull., 22 (3), 27-34.
  • Jain, A.K., Murty, M.N. and Flynn, P.J. (1999) Data clustering: A review. ACM Comput. Surv., 31 (3), 264-323.
  • Johnson, S. (1967) Hierarchical clustering schemes. Psychometrika, 32 (3), 241-254.
  • Lee, M.-L., Yang, L.H., Hsu, W. and Yang, X. (2002) Xclust: clustering xml schemas for effective integration. In: CIKM. ACM, 292-299.
  • Lesniewska, A. (2009) Clustering xml documents by structure. In: J. Grundspenkis, M. Kirikova, Y. Manolopoulos and L. Novickis, eds., ADBIS (Workshops). LNCS 5968, Springer, 238-246.
  • Lian,W., Cheung, D.W.-L., Mamoulis, N. and Yiu, S.-M. (2004) An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng., 16 (1), 82-96.
  • Milligan, G. and Cooper, M. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159-179. 10.1007/BF02294245.
  • Nayak, R. and Iryadi, W. (2006) Xmine: A methodology for mining xml structure. In: X. Zhou, J. Li, H.T. Shen, M. Kitsuregawa and Y. Zhang, eds., APWeb. LNCS 3841, Springer, 786-792.
  • Tan, P.-N., Steinbach, M. and Kumar, V. (2005) Introduction to Data Mining. Addison Wesley.
  • Tran, T., Nayak, R. and Bruza, P. (2008) Combining structure and content similarities for xml document clustering. In: J.F. Roddick, J. Li, P. Christen and P.J. Kennedy, eds., AusDM. CRPIT 87, Australian Computer Society, 219-226.
  • Widom, J. (1999) Data management for xml: Research directions. IEEE Data Eng. Bull., 22 (3), 44-52.
  • Zaki, M.J. (2002) Efficiently mining frequent trees in a forest. In: KDD. ACM, 71-80.
Typ dokumentu
Identyfikator YADDA
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.