Efficient Search Methods for Statistical Dependency Rules

Hämäläinen, W.

Powiadomienia systemowe

Sesja wygasła!
Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Efficient Search Methods for Statistical Dependency Rules

Autorzy

Hämäläinen W.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Dependency analysis is one of the central problems in bioinformatics and all empirical science. In genetics, for example, an important problem is to find which gene alleles are mutually dependent or which alleles and diseases are dependent. In ecology, a similar problem is to find dependencies between different species or groups of species. In both cases a classical solution is to consider all pairwise dependencies between single attributes and evaluate the relationships with some statistical measure like the χ2-measure. It is known that the actual dependency structures can involve more attributes, but the existing computational methods are too inefficient for such an exhaustive search. In this paper, we introduce efficient search methods for positive dependencies of the form X → A with typical statistical measures. The efficiency is achieved by a special kind of a branch-andbound search which also prunes out redundant rules. Redundant attributes are especially harmful in dependency analysis, because they can blur the actual dependencies and even lead to erroneous conclusions. We consider two alternative definitions of redundancy: the classical one and a stricter one. We improve our previous algorithm for searching for the best strictly non-redundant dependency rules and introduce a totally new algorithm for searching for the best classically non-redundant rules. According to our experiments, both algorithms can prune the search space very efficiently, and in practice no minimum frequency thresholds are needed. This is an important benefit, because biological data sets are typically dense, and the alternative search methods would require too large minimum frequency thresholds for any practical purpose.

Słowa kluczowe

statistical dependence redundancy Z-score search algorithms

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2011

Tom

Vol. 113, nr 2

Strony

117--150

Opis fizyczny

Bibliogr. 26 poz., tab., wykr.

Twórcy

autor

Hämäläinen W.

Department of Biosciences, University of Eastern Finland, Kuopio, Finland, wilhelmiina.hamalainen@uef.fi

Bibliografia

[1] Aggarwal, C., Yu, P.: A New Framework For Itemset Generation, Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), ACMPress, 1998.
[2] Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ACM Press, 1993.
[3] Agrawal, R., Srikant, R.: Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94),Morgan Kaufmann, 1994.
[4] Antonie, M.-L., Za¨ıane, O. R.: Mining positive and negative association rules: an approach for confined rules, Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'04), Springer-Verlag, 2004.
[5] Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J.-F., Gandrillon, O.: Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data, Genome Biology, 3, 2002, 1-16.
[6] Blanchard, J., Guillet, F., Gras, R., Briand, H.: Using Information-Theoretic Measures to Assess Association Rule Interestingness, Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05), IEEE Computer Society, 2005.
[7] Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation, Proceedings of the 15th Conference on Computational Statistics (COMPSTAT 2002), Physica Verlag, Heidelberg, Germany, 2002.
[8] Dale, M., Blundon, D., MacIsaac, D., Thomas, A.: Multiple species effects and spatial autocorrelation in detecting species associations, Journal of Vegetation Science, 2, 1991, 635-642.
[9] Frank, A., Asuncion, A.: UCI Machine Learning Repository, 2010, http://archive.ics.uci.edu/ml.
[10] Hämäläinen,W.: Lift-based search for significant dependencies in dense data sets, Proceedings of the Workshop on Statistical and Relational Learning in Bioinformatics (StReBio'09), in the 15th ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD'09), ACM Press, 2009.
[11] Hämäläinen, W.: StatApriori: an efficient algorithm for searching statistically significant association rules, Knowledge and Information Systems: An International Journal (KAIS), 23(3), June 2010, 373-399.
[12] Hämäläinen, W., Nykänen, M.: Efficient discovery of statistically significant association rules, Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08), 2008.
[13] Koh, Y., Pears, R.: Efficiently Finding Negative Association Rules Without Support Threshold, Advances in Artificial Intelligence, Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI 2007), 4830, Springer, 2007.
[14] Kwiatkowski, D.: How Malaria Has Affected the Human Genome andWhat Human Genetics Can Teach Us about Malaria, The American Journal of Human Genetics, 77, 2005, 171-192.
[15] Li, J.: On Optimal Rule Discovery, IEEE Transactions on Knowledge and Data Engineering, 18(4), 2006, 460-471.
[16] Liu, F., Wang, W., Zhang, M., Zheng, J., Wang, Z., Zhang, S., Yang, W., An, S.: Species association in tropical montane rain forest at two successional stages in Diaoluo Mountain, Hainan, Frontiers of Forestry in China, 3, 2008, 308-314.
[17] Mannila, H., Toivonen, H., Verkamo, A.: Efficient algorithms for discovering association rules, Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD'94), AAAI Press, 1994.
[18] Mei, H., Cuccaro, M., Martin, E.: Multifactor Dimensionality Reduction-Phenomics: A Novel Method to Capture Genetic Heterogeneity with Use of Phenotypic Variables, American Jounal of Human Genetics, 81, 2007, 1251-61.
[19] Moore, J., Asselbergs, F., Williams, S. M.: Bioinformatics challenges for genome-wide association studies, Bioinformatics, 26(4), 2010, 445-455.
[20] Morishita, S., Nakaya, A.: Parallel branch-and-bound braph bearch for borrelated bssociation bules, Large-Scale Parallel Data Mining, Revised Papers from the Workshop on Large-Scale Parallel KDD Systems, in the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 1759, Springer-Verlag, 2000.
[21] Morishita, S., Sese, J.: Transversing itemset lattices with statistical metric pruning, Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'00), ACM Press, 2000.
[22] Nijssen, S., Guns, T., Raedt, L. D.: Correlated itemset mining in ROC space: a constraint programming approach, Proceedings the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'09), ACM Press, 2009.
[23] Nijssen, S., Kok, J.: Multi-class Correlated Pattern Mining, Proceedings of the 4th International Workshop on Knowledge Discovery in Inductive Databases, 3933, Springer, 2006.
[24] Webb, G.: Discovering significant rules, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), ACM Press, 2006.
[25] Webb, G.: Discovering Significant Patterns, Machine Learning, 68(1), 2007, 1-33.
[26] Webb, G., Zhang, S.: K-Optimal Rule Discovery, Data Mining and Knowledge Discovery, 10(1), 2005, 39-79.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS8-0022-0070