Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

Tan, H.; Hadzic, F.; Dillon, T. S.

Powiadomienia systemowe

Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

Autorzy

Tan H. , Hadzic F. , Dillon T. S.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The increasing need for representing information through more complex structures where semantics and relationships among data objects can be more easily expressed has resulted in many semi-structured data sources. Structure comparison among semi-structured data objects can often reveal valuable information, and hence tree mining has gained a considerable amount of interest in areas such as XML mining, Bioinformatics, Web mining etc. We are primarily concerned with the task of mining frequent ordered induced and embedded subtrees from a database of rooted ordered labeled trees. Our previous contributions consist of the efficient Tree Model Guided (TMG) candidate enumeration approach for which we developed a mathematical model that provides an estimate of the worst case complexity for embedded subtree mining. This potentially reveals computationally impractical situations where one would be forced to constrain the mining process in some way so that at least some patterns can be discovered. This motivated our strategy of tackling the complexity of mining embedded subtrees by introducing the Level of Embedding constraint. Thus, when it is too costly to mine all frequent embedded subtrees, one can decrease the level of embedding constraint gradually down to 1, from which all the obtained frequent subtrees are induced subtrees. In this paper we develop alternative implementations and propose two algorithms MB3-R and iMB3-R, which achieve better efficiency in terms of time and space. Furthermore, we develop a mathematical model for estimating the worst case complexity for induced subtree mining. It is accompanied with a theoretical analysis of induced-embedded subtree relationships in terms of complexity for frequent subtree mining. Using synthetic and real world data we practically demonstrate the space and time efficiency of our new approach and provide some comparisons to the two well know algorithms for mining induced and embedded subtrees.

Słowa kluczowe

algorithms performance experimentation tree mining TreeMiner FREQT TMG Tree Model Guided

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2012

Tom

Vol. 119, nr 2

Strony

187--231

Opis fizyczny

Bibliogr. 46 poz., tab., wykr.

Twórcy

autor

Tan H.

autor

Hadzic F.

autor

Dillon T. S.

Digital Ecosystems and Business Intelligence Institute (DEBII), Curtin University of Technology, Perth, Australia, fedja.hadzic@curtin.edu.au

Bibliografia

[1] Agrawal, R., Srikant, R.: Fast algorithm for mining association rules. In Proceedings of the 20th Conference on Very Large Data Bases (VLDB 1994), Satiago de Chile, Chile, 1994, 487-499.
[2] Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Coferece on Management of Data (SIGMOD 1993), Washigton, DC, USA, 1993, 207-216.
[3] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In Advances in Knowledge Discovery ad Data Mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy, Eds. American Association for Artificial Itelligence, CA, USA, 1996, 307-328.
[4] Agrawal, R., Srikant, R.: Mining sequential patterns. In Proceedings of the 11th IEEE Proceedings of the 11th Interatioal Conferece on Data Engieering (ICDE 1995), Taipei, Taiwa, 1995, 3-14.
[5] Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In Proceedings of the 2nd SIAM Interatioal Conferece on Data Mining, 2002, 158U˝ 174.
[6] Bayardo, R. J.: Efficiently mining log patterns from databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 1998), Seattle, WA, USA, 1998, 85-93.
[7] Bayardo, R.J., Agrawal, R., Gunopulos, D.: Costraint-based rule mining on large, dense data sets', In Proceedings of the 1999 Interatioal Cof. on Data Egieerig (ICDE'99), Sydney, Australia, 1999.
[8] Bodon, F.: A fast apriori implementation. Informatics Laboratory, Computer ad Automation Research Institute, Hungaria Academy of Scieces, 2003.
[9] Brin, S., Motwani, R., Ullman, J. D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In Proceedings of ACM SIGMOD Conference o Management of Data (SIGMOD 1997), Arizona, USA, 1997, 255-264.
[10] Chi, Y., Yang, Y., Muntz, R.R.: HybridTreeMiner: A efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In Proceedings of the 16th Interational Conference on Scietific and Statistical Database Management, Satorii Island, Greece, 2004.
[11] Chi, Y., Nijssen, S., Muntz, R.R., Kok. J..: Frequent subtree mining an overview. Fundamenta Informaticae, Special Issue on Graph ad Tree Mining, vol. 65, no. 1-2, 2005, 161-198.
[12] Cook, D., Holder, L.: Substructure discovery using minimal description length and background knowledge, Journal of Artificial Intelligence Research, 1, 1994, 231-255.
[13] Dehaspe, L., Toivonen, H., King, R.: Finding frequent substructures in chemical compounds, In 4th Intl. Conf. Knowledge Discovery ad Data Mining, August 1998.
[14] Feng, L., Dillon, T.S., Weigand, H., Chang, E.: An XML-Enabled association rule framework. In Proceedings of the 14th Database and Expert Systems Applications (DEXA 2003), Prague, Czech Republic, 2003, 88-97.
[15] Feng, L., Dillon, T.S.: Mining XML-Enabled association rules with templates. In Proceedings of the 3rd Interational Workshop on Kowledge Discovery in Inductive Databases (KDID 2004), Pisa, Italy, 2004, 66-88.
[16] Feng, L., Dillon, T.S.: An XML-Enabled data mining query laguage XML-DMQL (invited paper). Interatioal Journal of Business Itelligece ad Data Mining, vol. 1, no. 1, 2005, 22-41.
[17] Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In Proceedings of ACM SIGMOD Conferece on Management of Data, Dallas, Texas, USA, 2000, 1-12.
[18] Hido, S. Ad Kawano, H.: AMIOT: Induced Ordered Tree Mining in Tree-structured Databases, In Proceedings of the 5th IEEE Interational Conference on Data Mining (ICDM'05), 170-177, Houston, Texas, USA, 2005.
[19] Inokuchi, A.,Washio, T. Motoda, H.: An Apriori-based algorithm for mining frequent substructures from graph data. In D.A. Zighed, H.J.Komorowski, ad J.M. Zytkow, editors, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 2000, 13-23.
[20] Inokuchi, A., Washio, T., Nishimura, Y., Motoda, H.: General framework for mining frequent patterns in structures. In Proceedings of the ICDM-2002 workshop on Active Mining (AM-2002), 2002, 23-30.
[21] Jenkins, B.: Hash Fuctios. Dr. Dobb's Joural, September 1997.
[22] Kudo, T.: An implementation of FREQT, http://www.chase.org/˜taku/software/freqt/, 2003. (Last accessed 1 Jan 2006).
[23] Kuramochi, M. Karypis, G.: An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engieering, vol. 16, no. 9, 2004, 1038-1051.
[24] Nijssen, S. and Kok, J..: Efficient discovery of frequent unordered trees. In Proceedings of the 1st InterationalWorkshop on Mining Graphs, Trees, ad Sequences (MGTS-2003), Dubrovnik, Croatia, 2003.
[25] Park, J. S., Chen, M.-S., Yu, P.S.: Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Trasactions on Knowledge ad Data Engieering, vol. 9, no. 5, 1997, 813-825.
[26] Ruckert, U. Kramer, S.: Frequent free tree discovery in graph data. In Proceedings of the 2004 ACM symposium on Applied computing, Nicosia, Cyprus, 2004, 564 - 570.
[27] Sidhu, A. S., Dillon T. S., Chang, E., Sidhu, B.S.: Protein ontology: vocabulary for protein data. In Proceedings of the 3rd IEEE Interational Conference on Information Technology and Applications (ICITA 2005), Sydney, Australia, 2005, 465-469.
[28] Suciu, D.: Semistructured data and XML. In Information Orgaization and Databases: Foundatios of Data Orgaization, K. Tanaka, S. Ghandeharizadeh, ad Y. Kambayashi, Eds. Kluwer Interatioal Series in Egieering and Computer Science Series, vol. 579. Kluwer Academic Publishers, Norwell, MA, 2000, 9-30.
[29] Tan, H., Dillon, T.S., Feng, L., Chang, E., Hadzic, F.: X3-Miner: mining patterns from XML database. In Proceedings of the 6th Interational Data Mining 2005, Skiathos, Greece, 2005, 287-297.
[30] Tan, H., Dillon, T.S., Hadzic, F., Feng, L., Chang, E.: MB3-Miner: mining eMBedded subTREEs using Tree Model Guided candidate generation. In Proceedings of the 1st Interational Workshop on Mining Complex Data 2005 in conjuctio with ICDM 2005, Houston, Texas, USA, 2005, 103-110.
[31] Tan, H., Dillon, T.S., Hadzic, F., Feng, L., Chang, E.: IMB3-Miner: Mining Induced/Embedded subtrees by constraining the level of embedding. In Proceedings of Pacific-Asia Conferece on Knowledge Discovery and Data Mining (PAKDD 2006), Singapore, 2006.
[32] Tan, H., Hadzic, F., Dillon, T.S., Feng, L., Chang, E.: Tree Model Guided Cadidate Generation for Mining Frequent Subtrees from XML. ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 2, no. 2, July 2008.
[33] Tan, H., Dillon, T.S., Hadzic, F., Chang, E. Feng, L.: SEQUEST: mining frequent subsequeces using DMA Strips. In Proc. of Data Mining '06, 11-13 July, Prague, Czech Republic, 2006.
[34] Tatikonda, S., Parthasarathy, S., Kurc, T.: TRIPS ad TIDES: New algorithms for tree mining, In Proceedings of the 15th ACM Interational Conferece on Informatio and Knowledge Management (CIKM '06), Arlington, Virginia, USA, 2006.
[35] Termier, A., Rousset, M-C., Sebag, M.: Treefinder: A first step towards XML data mining. In Proceedings of the 2nd IEEE Interational Conference on Data Mining (ICDM 2002), Maebashi City, Japan, 2002, 450-458.
[36] Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., Shi, B.: Efficient Pattern-Growth methods for frequent tree pattern mining. In Proceedings of Pacific-Asia Conferece on Knowledge Discovery and Data Mining (PAKDD 2004), Sydney, Australia, 2004, 441-451.
[37] Wang, K., Liu, H.: Discovering typical structures of documents: a road map approach. In Proceedings of the 21st Anual Interational ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, 146-154.
[38] Yan, X. and Han, J.: gSpan: Graph-based substructure pattern mining. In Proceedings of the 2nd IEEE Interational Conference on Data Mining (ICDM 2002), Maebashi City, Japan, 2002, 721-724.
[39] Yan, X. and Han, J.: Closegraph: Mining closed frequent graph patterns. In Proceedings of ACM SIGKDD Interational Conference on Knowledge Discovery and Data Mining, 2003.
[40] Yang, L.H., Lee, M.L., Hsu, W.: Efficient mining of XML query patterns for caching. In Proceedings of the 29th Interational Very Large Data Bases (VLDB) Conference, Berlin, Germay, 2003, 69-80.
[41] Yoshida, I.K., Motoda, H. Idurkhya, N.: Graph-based induction as a unified learning framework, In Applied Intelligence 4, Kluwer Academic Publishers, Dordrecht, Netherlads, 1994, 297-328.
[42] Xiao, Y., Yao, J.-F., Li, Z., Duham, M.H.: Efficient data mining for maximal frequent subtrees. In Proceedings of the 3rd IEEE Interational Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA, 2003, 379-386.
[43] Zhang, J., Ling, T. W., Brucker, R.M., Tjoa, A.M., Liu, H.: On efficient and effective association rule mining from XML data. In Proceedings of the 15th Interational Conference on Database and Expert Systems Applications (DEXA 2004), Zaragoza, Spain, 2004, 497 - 507.
[44] Zhang, S., Zhang, J., Liu, H., Wang, W.: XAR-miner: efficient association rules mining for XML data. In Proceedings of the 40th Interational World Wide Web Conference (Special interest tracks and posters), Chiba, Japa, 2005, 894-895.
[45] Zaki, M.J.: Fast Vertical Mining Using Diffsets. In Proceedings of the 9th ACM SIGKDD Interational Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 2003.
[46] Zaki, M.J.: Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, 2005, 1021-1035

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS8-0027-0021