Learning from Skewed Class Multi-relational Databases

Guo, H.; Viktor, H.L.

Artykuł - szczegóły

Tytuł artykułu

Learning from Skewed Class Multi-relational Databases

Autorzy

Guo H. , Viktor H.L.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Relational databases, with vast amounts of data–from financial transactions, marketing surveys, medical records, to health informatics observations– and complex schemas, are ubiquitous in our society. Multirelational classification algorithms have been proposed to learn from such relational repositories, where multiple interconnected tables (relations) are involved. These methods search for relevant features both from a target relation (in which each tuple is associated with a class label) and relations related to the target, in order to better classify target relation tuples. However, in many practical database applications, such as credit card fraud detection and disease diagnosis, the target tuples are highly imbalanced. That is, the number of examples of one class (majority class) in the target relation is much higher than the others (minority classes). Many existing methods thus tend to produce poor predictive performance over the underrepresented class in the data. This paper presents a strategy to deal with such imbalanced multirelational data. The method learns from multiple views (feature sets) of relational data in order to construct view learners with different awareness of the imbalanced problem. These different observations possessed by multiple view learners are then combined, in order to yield a model which has better knowledge on both the majority and minority classes in a relational database. Experiments performed on six benchmarking data sets show that the proposed method achieves promising results when compared with other popular relational data mining algorithms, in terms of the ROC curve and AUC value obtained. In particular, an important result indicates that the method is superior when the class imbalanced is very high.

Słowa kluczowe

multi-relational data mining classification Multi-view Learning relational databases imbalanced classes ensemble

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2008

Tom

Vol. 89, nr 1

Strony

69--94

Opis fizyczny

bibliogr. 61 poz., tab., wykr.

Twórcy

autor

Guo H.

autor

Viktor H.L.

School of Information Technology and Engineering University of Ottawa, Canada, hguo028@site.uottawa.ca

Bibliografia

[1] Berka, P.: Guide to the Financial Data Set., A. Siebes and P. Berka, editors, PKDD2000 Discovery Challenge, 2000.
[2] Blockeel, H., Raedt, L. D.: Top-Down Induction of First-Order Logical Decision Trees, Artificial Intelligence, 101(1-2), 1998, 285-297.
[3] Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training, Proceedings of the Workshop on Computational Learning Theory, 1998.
[4] Breiman, L.: Bagging Predictors, Machine Learning, 24(2), 1996, 123-140.
[5] Breiman, L.: Random Forests, Machine Learning, 45(1), 2001, 5-32, ISSN 0885-6125.
[6] Bryll, R., Gutierrez Osuna, R., Quek, F.: Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets, PR, 36(6), June 2003, 1291-1302.
[7] Cardie, C., Nowe, N.: Improving Minority Class Prediction Using Case-Specific Feature Weights, ICML '97: Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997, ISBN 1-55860-486-3.
[8] Ceci, M., Appice, A., Malerba, D.: Mr-SBC: A Multi-relational Na¨ıve Bayes Classifier, PKDD '03: Proceedings of the 7th European Conference on Principles of Data Mining and Knowledge Discovery, 2003.
[9] Ceci, M., Berardi,M., Malerba, D.: Relational data mining and ILP for document image processing, Applied Artificial Intelligence, 21(8), 2007, 317-342.
[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence and Research, 16, 2002, 321-357.
[11] Collins, M., Singer, Y.: Unsupervised models for named entity classification, Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
[12] Dasgupta, S., Littman, M. L., McAllester, D. A.: PAC Generalization Bounds for Co-training., Proceedings of NIPS '01, Neural Information Processing Systems, 2001.
[13] de Sa, V. R.: Learning Classification with Unlabeled Data, Proceedings of NIPS'93, Neural Information Processing Systems (J. D. Cowan, G. Tesauro, J. Alspector, Eds.), San Francisco, CA, 1993.
[14] Dietterich, T. G.: Machine-Learning Research: Four Current Directions, The AI Magazine, 18(4), 1998, 97-136.
[15] Domingos, P.: MetaCost: A General Method for Making Classifiers Cost-Sensitive, KDD '99: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
[16] Dzeroski, S., Raedt, L. D.: Multi-relational data mining: an introduction, SIGKDD Explor. Newsl., 5(1), 2003, 1-16.
[17] Elmasri, R., Navathe, S. B.: Fundamentals of Database Systems, CA, USA, 1989, ISBN 0-8053-0145-3.
[18] Fawcett, T.: An introduction to ROC analysis, Pattern Recogn. Lett., 27(8), 2006, 861-874, ISSN 0167-8655.
[19] Freund, Y., Schapire, R. E.: Experiments with a New Boosting Algorithm, ICML '96: Proceedings of the 13th International Conference on Machine Learning, 1996.
[20] Garcia-Molina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book, Prentice Hall, 2002.
[21] Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning ProbabilisticModels of Relational Structure, ICML '01: Proceedings of the 18th International Conference on Machine Learning, 2001.
[22] Ghani, R.: Combining Labeled and Unlabeled Data for MultiClass Text Categorization, ICML '02: Proceedings of the 19th International Conference on Machine Learning,Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002, ISBN 1-55860-873-7.
[23] Guo, H., Viktor, H. L.: Learning fromimbalanced data sets with boosting and data generation: the DataBoost-IM approach, SIGKDD Explor. Newsl., 6(1), 2004, 30-39, ISSN 1931-0145.
[24] Guo, H., Viktor, H. L.: Mining relational databases with multi-view learning, MRDM '05: Proceedings of the 4th International Workshop on Multi-relational Mining, ACM Press, 2005, ISBN 1-59593-212-7.
[25] Guo, H., Viktor, H. L.: Mining Imbalanced Classes in Multirelational Classification, MRDM '07: Proceedings of the 6th Multi-Relational Data Mining Workshop, Editors D. Malerba, A. Appice, and M. Ceci, 2007.
[26] James, G.: Majority Vote Classifiers: Theory and Applications, Ph.D. Thesis, Stanford University, 1998.
[27] Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies, AAAI Workshop on Learning from Imbalanced Data Sets. Tech. rep. WS-00-05,Menlo Park, CA: AAAI Press., 2000.
[28] John, G. H., Langley, P.: Estimating ContinuousDistributions in Bayesian Classifiers., UAI '95: Proceedings of the 11th Conference on Uncertainty in AI, 1995.
[29] Kietz, J.-U., Zcker, R., Vaduva, A.: MINING MART: Combining Case-Based-Reasoning and Multistrategy Learning into a Framework for Reusing KDD-Applications, 5th International Workshop on Multistrategy Learning (MSL 2000), Guimaraes, Portugal, 2000.
[30] Kiritchenko, S., Matwin, S.: Email classification with co-training, CASCON '01: Proceedings of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, IBM Press, 2001.
[31] Knobbe, A. J.: Multi-Relational Data Mining, Ph.D. Thesis, University Utrecht, 2004.
[32] Kockelkorn,M., L¨uneburg,A., Scheffer, T.: Using Transduction andMulti-view Learning to Answer Emails., PKDD '03: Proceedings of the 7th European Conference on Principles of Data Mining and Knowledge Discovery, 2003.
[33] Krogel, M.-A.: On Propositionalization for Knowledge Discovery in Relational Databases, Ph.D. Thesis, The Faculty of Computer Science, Otto-von-Guericke-UniversittMagdeburg, 2005.
[34] Krogel, M.-A., Scheffer, T.: Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics, Mach. Learn., 57(1-2), 2004, 61-81, ISSN 0885-6125.
[35] Krogel, M.-A., Wrobel, S.: Facets of aggregation approaches to propositionalization, Proceedings of the Work-in-Progress Track at the ILP, 2003.
[36] Kubat, M., Holte, R. C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images, Machine Learning, 30(2-3), 1998, 195-215.
[37] Lazarevic, A., Kumar, V.: Feature bagging for outlier detection., KDD '05: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005.
[38] Ling, C., Sheng, V.: Cost-sensitive Learning and the Class Imbalanced Problem, Encyclopedia of Machine Learning. C. Sammut (Ed.), Springer, 2007.
[39] Ling, C. X., Li, C.: Data Mining for Direct Marketing: Problems and Solutions, KDD '98: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998.
[40] Ling, C. X., Sheng, V. S.: Test Strategies for Cost-Sensitive Decision Trees, IEEE Transactions on Knowledge and Data Engineering, 18(8), 2006, 1055-1067, ISSN 1041-4347.
[41] Maloof, M.: Learning when data sets are imbalanced and when costs are unequal and unknown, ICML Workshop on Learning from Imbalanced Data Sets II, 2003.
[42] Muslea, I., Minton, S., Knoblock, C.: Adaptive view validation: A first step towards automatic view detection, ICML '02: Proceedings of the 19th International Conference on Machine Learning, 2002.
[43] Muslea, I. A.: Active learning with multiple views, Ph.D. Thesis, Department of Computer Science, University of Southern California, 2002.
[44] Opitz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study, Journal of Artificial Intelligence Research, 11, 1999, 169-198.
[45] Pierce, D., Cardie, C.: Limitations of Co-Training for Natural Language Learning from Large Datasets, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 2001.
[46] Provost, F.: Machine Learning fromImbalanced Data Sets 101 (Extended Abstract), Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets., 2000.
[47] Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments, Mach. Learn., 42(3), 2001, 203-231, ISSN 0885-6125.
[48] Qin, Z., Ling, C. X., Sheng, S.: "Missing Is Useful': Missing Values in Cost-Sensitive Decision Trees, IEEE Transactions on Knowledge and Data Engineering, 17(12), 2005, 1689-1693, ISSN 1041-4347.
[49] Quinlan, J. R.: C4.5: programs for machine learning, Morgan Kaufmann Publishers Inc., USA, 1993, ISBN 1-55860-238-0.
[50] Quinlan, J. R.: Bagging, Boosting, and C4.5, AAAI/IAAI, Vol. 1, 1996.
[51] Quinlan, J. R., Cameron-Jones, R. M.: FOIL: A Midterm Report., ECML '93: Proceedings of the European Conference on Machine Learning, 1993.
[52] Ramakrishnan, R., Gehrke, J.: Database Management Systems, McGraw-Hill Companies, 2003.
[53] Ruping, S., Scheffer, T.: Workshop on Learning with Multiple Views, Stefan Ruping and Tobias Scheffer, editors, Proceedings of the ICML Workshop on Learning with Multiple Views, 2005.
[54] de Sa, V. R., Ballard, D. H.: Category learning through multimodality sensing, Neural Computation, 10(5), 1998, 1097-1117, ISSN 0899-7667.
[55] Sen, P., Getoor, L.: Cost-sensitive learning with conditional Markov networks, ICML '06: Proceedings of the 23rd International Conference on Machine Learning, ACM Press, New York, NY, USA, 2006.
[56] Sheng, V. S., Ling, C. X.: Thresholding for Making Classifiers Cost-sensitive, AAAI '06: The 21st National Conference on Artificial Intelligence, 2006.
[57] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D.: Theories for mutagenicity: a study in first-order and feature-based induction, Artif. Intell., 85(1-2), 1996, 277-299, ISSN 0004-3702.
[58] Weiss, G., Provost, F.: The effect of class distribution on classifier learning, Technical Report ML-TR 43, Department of Computer Science, Rutgers University, 2001.
[59] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, CA, USA, 2000, ISBN 1-55860-552-5.
[60] Wolpert, D. H.: Stacked Generalization, Technical Report LA-UR-90-3460, Los Alamos, NM, 1990.
[61] Yin, X., Han, J., Yang, J., Yu, P. S.: CrossMine: Efficient Classification Across Multiple Database Relations, ICDE '04: Proceedings of the 20th International Conference on Data Engineering, Boston, 2004.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS8-0003-0053