Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data

Wojciechowski, S.; Wilk, S.

Artykuł - szczegóły

Tytuł artykułu

Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data

Autorzy

Wojciechowski S. , Wilk S.

Wybrane pełne teksty z tego czasopisma

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors – 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods – SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Słowa kluczowe

imbalanced data difficulty factors preprocessing methods learning and classification

Wydawca

Wydawnictwo Politechniki Poznańskiej

Czasopismo

Foundations of Computing and Decision Sciences

Rocznik

2017

Tom

Vol. 42, No. 2

Strony

149--176

Opis fizyczny

Bibliogr. 33 poz., fig., tab.

Twórcy

autor

Wojciechowski S.

Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznań, Poland

autor

Wilk S.

szymon.wilk@cs.put.poznan.pl

Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznań, Poland

Bibliografia

[1] Bak B.A., Jensen J.L.: High dimensional classifiers in the imbalanced case, Computational Statistics and Data Analysis, 2016, 98, 46–59.
[2] Batista G., Silva D., Prati R.: An experimental design to evaluate class imbalance treatment methods, in: Proc. of ICMLA’12 (Vol. 2), IEEE, 2012, 95-101.
[3] Caruana R., Karampatziakis N., Yessenalina A.: An empirical evaluation of supervised learning in high dimensions, in: Proc. of the 25th International Conference on Machine Learning (ICML 2008), 2008, 96–103.
[4] Chawla N., Bowyer K., Hall L., Kegelmeyer W.: Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 2002, 341–378.
[5] Demšar J.: Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, 2006, 1–30.
[6] Dittman D.J., Khoshgoftaar T.M., Napolitano A.: Selecting the appropriate data sampling approach for imbalanced and high-dimensional bioinformatics datasets. in: Proc. - IEEE 14th International Conference on Bioinformatics and Bioengineering (BIBE 2014), 2014, 304–310.
[7] Drummond C., Holte R.: Severe class imbalance: Why better algorithms aren’t the answer, in: Proc. of the 16th European Conference on Machine Learning (ECML 2005), Springer, 2005, 539–546.
[8] Fernández A., López V., Galar M., Del Jesus M.J., Herrera F.: Analyzing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-Based Systems, 2013, 42, 97–110.
[9] García V., Sánchez J., Mollineda R.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, in: Proc. of the 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, Springer, 2007, 397–406.
[10] García V., Sánchez J., Mollineda R.: On the k-NN performance in a challenging scenario of imbalance and overlapping, Pattern Analysis and Applications, 11, 3-4, 2008, 269–280.
[11] García V., Sánchez J., Mollineda R.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance, Knowledge-Based Systems, 23, 1, 2012, 13–21.
[12] He H., Ma Y.: Imbalanced Learning: Foundations, Algorithms and Applications, Wiley, 2013.
[13] Van Hulse J., Khoshgoftaar T.M., Napolitano A.: Experimental perspectives on learning from imbalanced data, in: Proc. of the 24th International Conference on Machine Learning (ICML 2007), 2007, 17–23.
[14] Japkowicz N., Stephen S.: The class imbalance problem: A systematic study, Intelligent Data Analysis 6, 5, 2002, 429–449.
[15] Japkowicz N.: Class imbalance: Are we focusing on the right issue, in: Proc. of the 2nd Workshop on Learning from Imbalanced Data Sets, ICML 2003, 2003, 17–23.
[16] Jo T., Japkowicz N.: Class imbalances versus small disjuncts, ACM Sigkdd Explorations Newsletter 6, 1, 2004, 40–49.
[17] Kang P., Cho S.: EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems, in: Proc. of the 13th International Conference on Neural Information Processing (ICONIP). Springer, 2006, 837–846.
[18] Krawczyk B.: Learning from imbalanced data: open challenges and future directions, Progress in Artificial Intelligence, 2016, 5 (4), 221–232.
[19] Kubat M., Matwin S.: Addressing the curse of imbalanced training sets: one-sided selection, in: Proc. of the 14th International Conference on Machine Learning (ICML 1997), 1997, 179–186.
[20] Laurikkala J.: Improving identification of difficult small classes by balancing class distribution, in: Proc. of the 8th Conference on Artificial Intelligence in Medicine (AIME 2001). LNCS 2101, Springer, 2001, 63–66.
[21] López V., Fernández A., García S., Palade V., Herrera F.: Empirical results and current trends on using data intrinsic characteristics: Empirical results and current trends on using data intrinsic characteristics, Information Sciences, 2013, 250, 113-141.
[22] Maaranen H., Miettinen K., Mäkelä M.M.: Quasi-random initial population for genetic algorithms, Computer and Mathematics with Applications, 47, 12, 1885–1895.
[23] Maciá M., Bernadó-Mansilla E., Orriols-Puig Albert: On the dimensions of data complexity through synthetic data sets in: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence. IOS Press, 2008, 244–252.
[24] Napierala K., Stefanowski J., Wilk S.: Learning from imbalanced data in presence of noisy and borderline examples, in: Proc. of the 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2010). LNAI 6086, Springer, 2010, 158–167.
[25] Napierala K., Stefanowski J.: Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems, 2016, 46, 3, 563–597.
[26] Sáez J.A., Krawczyk B., Wozniak M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced data sets, Pattern Recognition, 57, 2016, 164–178.
[27] Staelin C.: Parameter selection for support vector machines, Technical Report HPL-2002-354 (R.1). HP Laboratories, Israel, 2003.
[28] Tang Y., and Zhang Y.-Q., Chawla N., Krasser S.: SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39, 1, 281–288.
[29] Tomašev N., Mladenić D.: Class imbalance and the curse of minority hubs, Knowledge-Based Systems, 2013, 53, 157–172.
[30] Triguero I., del Río S., López V., Bacardit J., Benítez J., Herrera F.: ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem, Knowledge-Based Systems, 2014, 87, 69–79.
[31] Wah Y.B., Abd Rahman H.A., He H., Bulgiba A.: Handling imbalanced dataset using SVM and k-NN approach, in: AIP Conference Proceedings, 2016, 1750 (1), 020023.
[32] Wilk S., Stefanowski J., Wojciechowski S., Farion K., Michalowski W.: Application of preprocessing methods to imbalanced clinical data: An experimental study, in: Proc. of the 5th International Conference on Information Technologies in Biomedicine (ITiB 2016), Vol. 1, Springer, 2016, 503–515.
[33] Xie T., Yu H., Wilamowski B.: Comparison between traditional neural networks and radial basis function networks, in: 2011 IEEE International Symposium on Industrial Electronics. IEEE, 2011, 1194-1199.

Uwagi

Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę (zadania 2017).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-61f66631-7634-4a8d-9fc2-71e41ab6ce60