A Gaussian-based WGAN-GP oversampling approach for solving the class imbalance problem

Zhou, Qian; Sun, Bo

doi:10.61822/amcs-2024-0021

Artykuł - szczegóły

Tytuł artykułu

A Gaussian-based WGAN-GP oversampling approach for solving the class imbalance problem

Autorzy

Zhou Qian , Sun Bo

Treść / Zawartość

Pełne teksty:

09_zhou_sun_a_gaussian_based_wgan_gp_oversampling_approach_for_solving_2024_2.pdf

Pobierz

Identyfikatory

DOI

10.61822/amcs-2024-0021

Warianty tytułu

Języki publikacji

Abstrakty

In practical applications of machine learning, the class distribution of the collected training set is usually imbalanced, i.e., there is a large difference among the sizes of different classes. The class imbalance problem often hinders the achievable generalization performance of most classifier learning algorithms to a large extent. To ameliorate the learning performance, some effective approaches have been proposed in the literature, where the recently presented GAN-based oversampling methods are very representative. However, their generated minority class examples have the risk of high similarity and duplication degree. To further ameliorate the quality of the generated minority class examples, i.e., to make the generated examples effectively expand the minority class region, a novel oversampling approach named the GWGAN-GP is proposed, which is based on the Gaussian distribution label within the framework of a Wasserstein generative adversarial network with gradient penalty (WGAN-GP). Our GWGAN-GP approach incorporates the Gaussian distribution as an input label, thereby making the generated examples more diverse and dispersive. The examples are then combined with the original dataset to form a balanced dataset, which is subsequently utilized to evaluate the classification performance of three selected classification algorithms. Experimental results on 16 imbalanced datasets demonstrate that the GWGAN-GP not only generates examples that better conform to the distribution of the original dataset, but also achieves superior classification performance. Specifically, when combined with the KNN classifier, the GWGAN-GP significantly outperforms other oversampling approaches considered in the study.

Słowa kluczowe

machine learning class imbalance generative adversarial network oversampling data duplication

uczenie maszynowe klasy niezrównoważone nadpróbkowanie duplikacja danych

Wydawca

Oficyna Wydawnicza Uniwersytetu Zielonogórskiego

Czasopismo

International Journal of Applied Mathematics and Computer Science

Rocznik

2024

Tom

Vol. 34, no. 2

Strony

291--307

Opis fizyczny

Bibliogr. 58 poz., rys., tab., wykr.

Twórcy

autor

Zhou Qian

zq@sdau.edu.cn

Department of Computer Science and Technology, Shandong Agricultural University, 61 Daizong Street, 271018, Tai’an, Shandong, China

autor

Sun Bo

sunbo87@126.com

Department of Computer Science and Technology, Shandong Agricultural University, 61 Daizong Street, 271018, Tai’an, Shandong, China

Bibliografia

[1] Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein generative adversarial networks, International Conference on Machine Learning, Sydney, Australia, pp. 214-223.
[2] Barua, S., Islam, M.M. and Murase, K. (2013). PROWSYN: Proximity weighted synthetic oversampling technique for imbalanced data set learning, Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, pp. 317-328.
[3] Bourou, S., El Saer, A., Velivassaki, T.-H., Voulkidis, A. and Zahariadis, T. (2021). A review of tabular data synthesis using GANs on an IDS dataset, Information 12(09): 375.
[4] Breiman, L. (2001). Random forests, Machine Learning 45(1): 5-32.
[5] Breiman, L. (2017). Classification and Regression Trees, Routledge, London.
[6] Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F. and Harmouch, H. (2022). The effects of data quality on machine learning performance, arXiv: 2207.14529.
[7] Chaabane, I., Guermazi, R. and Hammami, M. (2020). Enhancing techniques for learning decision trees from imbalanced data, Advances in Data Analysis and Classification 14(3): 1-69.
[8] Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16: 321-357.
[9] Chen, J., Huang, H., Cohn, A.G., Zhang, D. and Zhou, M. (2022). Machine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning, International Journal of Mining Science and Technology 32(2): 309-322.
[10] Chen, J., Yan, Z., Lin, C., Yao, B. and Ge, H. (2023). Aero-engine high speed bearing fault diagnosis for data imbalance: A sample enhanced diagnostic method based on pre-training WGAN-GP, Measurement 213(7): 112709.
[11] Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1): 21-27.
[12] Cui, J., Zong, L., Xie, J. and Tang, M. (2023). A novel multi-module integrated intrusion detection system for high-dimensional imbalanced data, Applied Intelligence 53(1): 272-288.
[13] Derrac, J., Garcia, S., Sanchez, L. and Herrera, F. (2015). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17(2-3): 255-287.
[14] Douzas, G. and Bacao, F. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks, Expert Systems with Applications 91(1): 464-471.
[15] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository, http://archive.ics.uci.edu/ml.
[16] Fernández, A., Garcia, S., Herrera, F. and Chawla, N.V. (2018). Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61: 863-905.
[17] Freund, Y. and Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55(1): 119-139.
[18] García, S., Luengo, J. and Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining, Knowledge-Based Systems 98(7): 1-29.
[19] Gazzah, S. and Amara, N.E.B. (2008). New oversampling approaches based on polynomial fitting for imbalanced data sets, 2008 8th IAPR International Workshop on Document Analysis Systems, Nara, Japan, pp. 677-684.
[20] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets, Advances in Neural Information Processing Systems 27: 2672-2680.
[21] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. and Courville, A.C. (2017). Improved training of Wasserstein GANs, Advances in Neural Information Processing Systems 30: 5767-5777.
[22] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3(Mar): 1157-1182.
[23] He, H. and Garcia, E.A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21(9): 1263-1284.
[24] Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D. (2022). Synthetic data generation for tabular health records: A systematic review, Neurocomputing 493(27): 28-45.
[25] James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications to R, 2nd Edn, Springer, New York.
[26] Janicka, M., Lango, M. and Stefanowski, J. (2019). Using information on class interrelations to improve classification of multiclass imbalanced data: A new resampling algorithm, International Journal of Applied Mathematics and Computer Science 29(4): 769-781, DOI: 10.2478/amcs-2019-0057.
[27] Japkowicz, N. (2003). Class imbalances: Are we focusing on the right issue, Workshop on Learning from Imbalanced Data Sets II, Washington, USA, p. 63.
[28] Kaggle (2024), Datasets: Lower Back Pain, https://www.kaggle.com/datasets/sammy123/lower-back-pain-symptoms-dataset, and Telecom Churn, https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets.
[29] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection, 14th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, pp. 1137-1145.
[30] Kovács, G. (2019). An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83(9): 105662.
[31] Liu, X.-Y., Wu, J. and Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics 39(2): 539-550.
[32] López, V., Fernández, A., García, S., Palade, V. and Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250(33): 113-141.
[33] Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets, arXiv: 1411.1784.
[34] Miyato, T., Kataoka, T., Koyama, M. and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks, arXiv: 1802.05957.
[35] Moreo, A., Esuli, A. and Sebastiani, F. (2016). Distributional random oversampling for imbalanced text classification, Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy, pp. 805-808.
[36] Napierala, K. and Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems 46: 563-597.
[37] Nik, A.H.Z., Riegler, M.A., Halvorsen, P. and Storås, A.M. (2023). Generation of synthetic tabular healthcare data using generative adversarial networks, International Conference on Multimedia Modeling, Bergen, Norway, pp. 434-446.
[38] Ohsaki, M., Wang, P., Matsuda, K., Katagiri, S., Watanabe, H. and Ralescu, A. (2017). Confusion-matrix-based kernel logistic regression for imbalanced data classification, IEEE Transactions on Knowledge and Data Engineering 29(9): 1806-1819.
[39] Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y. (2018). Data synthesis based on generative adversarial networks, Proceedings of the VLDB Endowment 11(10): 1071-1083.
[40] Park, S. and Park, H. (2021). Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic, Computing 103(3): 401-424.
[41] Powers, D.M. (2020). Evaluation: From precision, recall and f-measure to ROC, informedness, markedness and correlation, arXiv: 2010.16061.
[42] Ren, J., Wang, Y., Cheung, Y.-m., Gao, X.-Z. and Guo, X. (2023). Grouping-based oversampling in kernel space for imbalanced data classification, Pattern Recognition 133(1): 108992.
[43] Sáez, J.A., Luengo, J., Stefanowski, J. and Herrera, F. (2015). SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291(2): 184-203.
[44] Sun, B., Zhou, Q., Wang, Z., Lan, P., Song, Y., Mu, S., Li, A., Chen, H. and Liu, P. (2023). Radial-based undersampling approach with adaptive undersampling ratio determination, Neurocomputing 553(39): 126544.
[45] Sun, Y., Wong, A.K. and Kamel, M.S. (2009). Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence 23(04): 687-719.
[46] Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference, Springer, New York.
[47] Wold, S., Esbensen, K. and Geladi, P. (1987). Principal component analysis, Chemometrics and Intelligent Laboratory Systems 2(1-3): 37-52.
[48] Woods, K.S., Doss, C.C., Bowyer, K.W., Solka, J.L., Priebe, C.E. and Kegelmeyer Jr, W.P. (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition and Artificial Intelligence 7(06): 1417-1436.
[49] Xie, Y. and Zhang, T. (2018). Imbalanced learning for fault diagnosis problem of rotating machinery based on generative adversarial networks, 2018 37th Chinese Control Conference (CCC), Wuhan, China, pp. 6017-6022.
[50] Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN, Advances in Neural Information Processing Systems 32: 7335-7345.
[51] Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O. and Li, H. (2017). High-resolution image inpainting using multi-scale neural patch synthesis, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6721-6729.
[52] Zhang, M., Wan, X., Gang, L., Lv, X., Wu, Z. and Liu, Z. (2021). An automated driving strategy generating method based on WGAIL-DDPG, International Journal of Applied Mathematics and Computer Science 31(3): 461-470, DOI: 10.34768/amcs-2021-0031.
[53] Zhang, Y., Liu, Y., Wang, Y. and Yang, J. (2023). An ensemble oversampling method for imbalanced classification with prior knowledge via generative adversarial network, Chemometrics and Intelligent Laboratory Systems 235(4): 104775.
[54] Zhao, Y., Li, H., Bissyandé, T.F., Klein, J. and Grundy, J. (2021). On the impact of sample duplication in machine-learning-based android malware detection, ACM Transactions on Software Engineering and Methodology 30(3): 1-38.
[55] Zhao, Z., Kunar, A., Birke, R. and Chen, L.Y. (2021). CTAB-GAN: Effective table data synthesizing, Asian Conference on Machine Learning, pp. 97-112, (virtual).
[56] Zheng, M., Li, T., Zhu, R., Tang, Y., Tang, M., Lin, L. and Ma, Z. (2020a). Conditional wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification, Information Sciences 512(7): 1009-1023.
[57] Zheng, W. and Zhao, H. (2020b). Cost-sensitive hierarchical classification for imbalance classes, Applied Intelligence 50(8): 2328-2338.
[58] Zhu, B., Pan, X., vanden Broucke, S. and Xiao, J. (2022). A GAN-based hybrid sampling method for imbalanced customer classification, Information Sciences 609(28): 1397-1411.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-f12bf0c2-f8a1-427c-b8d4-869deaf9b8c0