Complexfuzzy: novel clustering method for selecting training instances of cross-project defect prediction

Oztürk, Muhammed Maruf

doi:10.7494/csci.2021.22.1.3743

Artykuł - szczegóły

Tytuł artykułu

Complexfuzzy: novel clustering method for selecting training instances of cross-project defect prediction

Autorzy

Oztürk Muhammed Maruf

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.7494/csci.2021.22.1.3743

Warianty tytułu

Języki publikacji

Abstrakty

Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take the training and testing data of defect prediction from the same project; instead, dissimilar projects are employed. Selecting the proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method called complexFuzzy is presented for selecting the training data of CPDP. The method reveals the most defective instances that the experimental predictors exploit in order to complete the training. To that end, a fuzzy-based membership is constructed on the data sets. Hence, overfitting (which is a crucial problem in CPDP training) is alleviated. The performance of complexFuzzy is compared to its 5 counterparts on 29 data sets by utilizing 4 classifiers. According to the obtained results, complexFuzzy is superior to other clustering methods in CPDP performance.

Słowa kluczowe

cross-project defect prediction complexFuzzy training instance selection fuzzy clustering

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2021

Tom

T. 22 (1)

Strony

3--37

Opis fizyczny

Bibliogr. 74 poz., rys., tab.

Twórcy

autor

Oztürk Muhammed Maruf

muhammedozturk@sdu.edu.tr

Suleyman Demirel University, Department of Computer Engineering, Isparta, Turkey

Bibliografia

[1] Bansal M., Agrawal C.: Critical Analysis of Object Oriented Metrics in Software Development. In: 2014 Fourth International Conference on Advanced Computing and Communication Technologies, pp. 197–201, IEEE, 2014. doi: 10.1109/ACCT. 2014.106.
[2] Bishnu P.S., Bhattacherjee V.: Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm, IEEE Transactions on Knowledge and Data Engineering, vol. 24(6), pp. 1146–1150, 2012. doi: 10.1109/TKDE.2011.163.
[3] Cai D., Zhang C., He X.: Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD’10, p. 333, ACM Press, New York, USA, 2010. doi: 10.1145/1835804.1835848.
[4] Candela I., Bavota G., Russo B., Oliveto R.: Using Cohesion and Coupling for Software Remodularization. Is It Enough? ACM Transactions on Software Engineering and Methodology, vol. 25(3), pp. 1–28, 2016. doi: 10.1145/2928268.
[5] Canfora G., De Lucia A., Di Penta M., Oliveto R., Panichella A., Panichella S.: Multi-objective Cross-Project Defect Prediction. In: 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, pp. 252–261, IEEE, 2013. doi: 10.1109/ICST.2013.38.
[6] Chan J.Y.K., Leung A.P.: Efficient k-means++ with random projection. In: International Joint Conference on Neural Networks 2017, pp. 94–100, 2017.
[7] Chikofsky E.J., Cross J.H.: Reverse engineering and design recovery: A taxonomy, IEEE Software, vol. 7(1), pp. 13–17, 1990.
[8] Fenton N., Bieman J.: Software Metrics: A Rigorous and Practical Approach, CRC Press, 2014.
[9] Fraley C., Raftery A.E.: Model-Based Clustering, Discriminant Analysis, and Density Estimation, Journal of the American Statistical Association, vol. 97(458), pp. 611–631, 2002.
[10] Fraley C., Raftery A.E., Murphy T.B., Scrucca L.: MCLUST version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation, Technical report, vol. 597, 2012.
[11] Fritzke B.: The k-means-u* algorithm: non-local jumps and greedy retries improve k-means++ clustering, arXiv preprint arXiv:170609059, 2017.
[12] Fukushima T., Kamei Y., McIntosh S., Yamashita K., Ubayashi N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories – MSR 2014, pp. 172–181, ACM Press, New York, USA, 2014. doi: 10.1145/2597073.2597075.
[13] Garcıa H.L., Gonzalez I.M.: Self-organizing map and clustering for wastewater treatment monitoring, Engineering Applications of Artificial Intelligence, vol. 17(3), pp. 215–225, 2004.
[14] Hartigan J.A., Wong M.A.: Algorithm AS 136: A K-Means Clustering Algorithm, Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 28(1), pp. 100–108, 1979.
[15] He P., Ma Y., Li B.: TDSelector: A Training Data Selection Method for Cross- -Project Defect Prediction, arXiv preprint arXiv:161209065, 2016.
[16] He Z., Shu F., Yang Y., Li M., Wang Q.: An investigation on the feasibility of cross-project defect prediction, Automated Software Engineering, vol. 19(2), pp. 167–199, 2012.
[17] Herbold S.: Training data selection for cross-project defect prediction. In: Proceedings of the 9th International Conference on Predictive Models in Software Engineering, pp. 1–10, 2013.
[18] Herbold S., Trautsch A., Grabowski J.: A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches, IEEE Transactions on Software Engineering, vol. 44(9), pp. 811–833, 2018.
[19] Hosseini S., Turhan B., Gunarathna D.: A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction, IEEE Transactions on Software Engineering, vol. 45(2), pp. 111–147, 2019.
[20] Hosseini S., Turhan B., Mantyla M.: A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction, Information and Software Technology, vol. 95, pp. 296–312, 2018.
[21] Huang J., Li Y.F., Xie M.: An empirical analysis of data preprocessing for machine learning-based software cost estimation, Information and Software Technology, vol. 67, pp. 108–127, 2015.
[22] Ishida M., Takakura H., Okabe Y.: High-Performance Intrusion Detection Using Optigrid Clustering and Grid-Based Labelling. In: 2011 IEEE/IPSJ International Symposium on Applications and the Internet, pp. 11–19, IEEE, 2011.
[23] Jabangwe R., Borstler J., Smite D., Wohlin C.: Empirical evidence on the link between object-oriented measures and external quality attributes: a systematic literature review, Empirical Software Engineering, vol. 20(3), pp. 640–693, 2015. doi: 10.1007/s10664-013-9291-7.
[24] Jing X., Wu F., Dong X., Qi F., Xu B.: Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pp. 496–507, 2015.
[25] Kamei Y., Fukushima T., McIntosh S., Yamashita K., Ubayashi N., Hassan A.E.: Studying just-in-time defect prediction using cross-project models, Empirical Software Engineering, vol. 21(5), pp. 2072–2106, 2016.
[26] Kant S., Ansari I.A.: An improved K means clustering with Atkinson index to classify liver patient dataset, International Journal of System Assurance Engineering and Management, vol. 7(1), pp. 222–228, 2016.
[27] Kumar G.R., Mangathayaru N., Narasimha G.: An improved k-Means Clustering algorithm for Intrusion Detection using Gaussian function. In: Proceedings of The International Conference on Engineering & MIS 2015, pp. 1–7, 2015.
[28] Laradji I.H., Alshayeb M., Ghouti L.: Software defect prediction using ensemble learning on selected features, Information and Software Technology, vol. 58, pp. 388–402, 2015.
[29] Li Y., Huang Z., Wang Y., Fang B.: Evaluating Data Filter on Cross-Project Defect Prediction: Comparison and Improvements, IEEE Access, vol. 5, pp. 25646– 25656, 2017. doi: 10.1109/ACCESS.2017.2771460.
[30] Liebchen G.A., Shepperd M.: Data sets and data quality in software engineering. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp. 39–44, 2008.
[31] Limsettho N., Bennin K.E., Keung J.W., Hata H., Matsumoto K.: Cross project defect prediction using class distribution estimation and oversampling, Information and Software Technology, vol. 100, pp. 87–102, 2018.
[32] Liu C., Yang D., Xia X., Yan M., Zhang X.: A two-phase transfer learning model for cross-project defect prediction, Information and Software Technology, 107, pp. 125–136, 2019. doi: 10.1016/j.infsof.2018.11.005.
[33] Ludwig S.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability, International Journal of Machine Learning and Cybernetics, vol. 6(6), pp. 923–934, 2015.
[34] Ma L., Gu L., Li B., Ma Y., Wang J.: AN Improved K-Means Algorithm Based on Mapreduce and Grid, International Journal of Grid & Distributed Computing, vol. 8(1), pp. 189–200, 2015.
[35] Ma Y., Luo G., Zeng X., Chen A.: Transfer learning for cross-company software defect prediction,Information and Software Technology, vol. 54(3), pp. 248–256, 2012.
[36] Malhotra R., Chug A.: Application of Group Method of Data Handling model for software maintainability prediction using object oriented systems, International Journal of System Assurance Engineering and Management, vol. 5(2), pp. 165–173, 2014. doi: 10.1007/s13198-014-0227-4.
[37] Nam J., Fu W., Kim S., Menzies T., Tan L.: Heterogeneous Defect Prediction, IEEE Transactions on Software Engineering, vol. 44(9), pp. 874–896, 2018. doi: 10.1109/TSE.2017.2720603.
[38] Ni C., Liu W., Gu Q., Chen X., Chen D.: FeSCH: A Feature Selection Method Using Clusters of Hybrid-Data for Cross-Project Defect Prediction. In: 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp. 51–56, IEEE, 2017.
[39] Ni C., Liu W.S., Chen X., Gu Q., Chen D.X., Huang Q.G.: A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction, Journal of Computer Science and Technology, vol. 32(6), pp. 1090–1107, 2017. doi: 10. 1007/s11390-017-1785-0
[40] Ozturk M.: Adapting code maintainability to bat-inspired test case prioritization. In: Proceedings – 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2017, 2017. doi: 10.1109/INISTA. 2017.8001134.
[41] Peters F., Menzies T., Layman L.: LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, pp. 801–811, IEEE, 2015. doi: 10.1109/ICSE.2015.92.
[42] Peters F., Menzies T., Marcus A.: Better cross company defect prediction. In: Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 409–418, 2013.
[43] Poon W.N., Bennin K.E., Huang J., Phannachitta P., Keung J.W.: Cross-Project Defect Prediction Using a Credibility Theory Based Naive Bayes Classifier. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), pp. 434–441, IEEE, 2017.
[44] Porto F., Minku L., Mendes E., Simao A.: A Systematic Study of Cross-Project Defect Prediction With Meta-Learning, arXiv preprint arXiv:180206025, 2018.
[45] Rahman F., Posnett D., Devanbu P.: Recalling the “imprecision” of cross-project defect prediction. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pp. 1–11, 2012.
[46] Ren S., Zhang W., Munir H., Xia L.: Dissimilarity Space Based Multi-Source Cross-Project Defect Prediction, Algorithms, vol. 12(1), p. 13, 2019. doi: 10.3390/ a12010013.
[47] Rezaee M.R., Lelieveldt B.P., Reiber J.H.: A new cluster validity index for the fuzzy c-mean, Pattern Recognition Letters, vol. 19(3-4), pp. 237–246, 1998.
[48] Ryu D., Baik J.: Effective multi-objective na¨ıve Bayes learning for cross-project defect prediction, Applied Soft Computing, vol. 49, pp. 1062–1077, 2016. doi: 10.1016/j.asoc.2016.04.009.
[49] Ryu D., Jang J.I., Baik J.: A transfer cost-sensitive boosting approach for crossproject defect prediction, Software Quality Journal, vol. 25(1), pp. 235–272, 2017.
[50] Salvador S., Chan P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: 16th IEEE International Conference on Tools with Artificial Intelligence, pp. 576–584, IEEE, 2004.
[51] Sheikholeslami G., Chatterjee S., Zhang A.: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In: VLDB, vol. 98, pp. 428–439, 1998.
[52] Shepperd M., Bowes D., Hall T.: Researcher Bias: The Use of Machine Learning in Software Defect Prediction, IEEE Transactions on Software Engineering, vol. 40(6), pp. 603–616, 2014.
[53] Shukla S., Radhakrishnan T., Muthukumaran K., Neti L.B.M.: Multi-objective cross-version defect prediction, Soft Computing, vol. 22(6), pp. 1959–1980, 2018.
[54] Suzuki R., Shimodaira H.: Pvclust: an R package for assessing the uncertainty in hierarchical clustering, Bioinformatics, vol. 22(12), pp. 1540–1542, 2006.
[55] Turhan B., Menzies T., Bener A.B., Di Stefano J.: On the relative value of cross-company and within-company data for defect prediction, Empirical Software Engineering, vol. 14(5), pp. 540–578, 2009. doi: 10.1007/s10664-008-9103-7.
[56] Turver R.J., Munro M.: An early impact analysis technique for software maintenance, Journal of Software Maintenance: Research and Practice, vol. 6(1), pp. 35–52, 1994.
[57] Vesanto J., Alhoniemi E.: Clustering of the self-organizing map, IEEE Transactions on Neural Networks, vol. 11(3), pp. 586–600, 2000.
[58] Wang X., Wang Y., Wang L.: Improving fuzzy c-means clustering based on feature-weight learning, Pattern Recognition Letters, vol. 25(10), pp. 1123–1132, 2004.
[59] Wu F., Jing X.Y., Dong X., Cao J., Xu M., Zhang H., Ying S., Xu B.: Cross-project and within-project semi-supervised software defect prediction problems study using a unified solution. In: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), pp. 195–197, IEEE, 2017.
[60] Xia X., Lo D., Pan S.J., Nagappan N., Wang X.: HYDRA: Massively Compositional Model for Cross-Project Defect Prediction, IEEE Transactions on software Engineering, vol. 42(10), pp. 977–998, 2016.
[61] Xu Y., Qu W., Li Z., Ji C., Li Y., Wu Y.: Fast Scalable k-means++ Algorithm with MapReduce. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 15–28, Springer, 2014.
[62] Xu Y., Qu W., Li Z., Min G., Li K., Liu Z.: Efficient k-Means++ Approximation with MapReduce, IEEE Transactions on parallel and distributed systems, vol. 25(12), pp. 3135–3144, 2014.
[63] Xu Z., Yuan P., Zhang T., Tang Y., Li S., Xia Z.: HDA: Cross-Project Defect Prediction via Heterogeneous Domain Adaptation with Dictionary Learning, IEEE Access, vol. 6, pp. 57597–57613, 2018. doi: 10.1109/ACCESS.2018.2873755.
[64] Yoon K.A., Kwon O.S., Bae D.H.: An Approach to Outlier Detection of Software Measurement Data using the K-means Clustering Method. In: First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 443–445, IEEE, 2007. doi: 10.1109/ESEM.2007.49.
[65] Yu Q., Jiang S., Qian J.: Which Is More Important for Cross-Project Defect Prediction: Instance or Feature?. In: 2016 International Conference on Software Analysis, Testing and Evolution (SATE), pp. 90–95, IEEE, 2016.
[66] Yu Q., Jiang S., Zhang Y.: A feature matching and transfer approach for cross-company defect prediction, Journal of Systems and Software, vol. 132, pp. 366–378, 2017.
[67] Yu S.S., Chu S.W., Wang C.M., Chan Y.K., Chang T.C.: Two improved k-means algorithms, Applied Soft Computing, vol. 68, pp. 747–755, 2018.
[68] Yuan X., Khoshgoftaar T.M., Allen E.B., Ganesan K.: An application of fuzzy clustering to software quality prediction. In: Proceedings 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology, pp. 85–90, IEEE, 2000. doi: 10.1109/ASSET.2000.888052.
[69] Zhang F., Keivanloo I., Zou Y.: Data Transformation in Cross-Project Defect Prediction, Empirical Software Engineering, vol. 22(6), pp. 3186–3218, 2017.
[70] Zhang F., Zheng Q., Zou Y., Hassan A.E.: Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 309–320, IEEE, 2016.
[71] Zhang Y., Lo D., Xia X., Sun J.: Combined classifier for cross-project defect prediction: an extended empirical study, Frontiers of Computer Science, vol. 12(2), pp. 280–296, 2018.
[72] Zhou K., Fu C., Yang S.: Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation, Science China Information Sciences, vol. 57, pp. 1–8, 2014. doi: 10.1007/s11432-014-5146-0.
[73] Zhou Y., Yang Y., Lu H., Chen L., Li Y., Zhao Y., Qian J., Xu B.: How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction, ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 27(1), pp. 1–51, 2018.
[74] Zimmermann T., Nagappan N., Gall H., Giger E., Murphy B.: Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp. 91–100, 2009.

Uwagi

„Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).”

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-bdf97a7b-8cb3-4aac-9334-ca4dc1c2b15b