Novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster

Ali, Ahmed Hussein; Abdullah, Mahmood Zaki

doi:10.7494/csci.2019.20.4.3307

Artykuł - szczegóły

Tytuł artykułu

Novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster

Autorzy

Ali Ahmed Hussein , Abdullah Mahmood Zaki

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.7494/csci.2019.20.4.3307

Warianty tytułu

Języki publikacji

Abstrakty

The big data concept has elicited studies on how to accurately and efficiently extract valuable information from such huge dataset. The major problem during big data mining is data dimensionality due to a large number of dimensions in such datasets. This major consequence of high data dimensionality is that it affects the accuracy of machine learning (ML) classifiers; it also results in time wastage due to the presence of several redundant features in the dataset. This problem can be possibly solved using a fast feature reduction method. Hence, this study presents a fast HP-PL which is a new hybrid parallel feature reduction framework that utilizes spark to facilitate feature reduction on shared/distributed-memory clusters. The evaluation of the proposed HP-PL on KDD99 dataset showed the algorithm to be significantly faster than the conventional feature reduction techniques. The proposed technique required >1 minute to select 4 dataset features from over 79 features and 3,000,000 samples on a 3-node cluster (total of 21 cores). For the comparative algorithm, more than 2 hours was required to achieve the same feat. In the proposed system, Hadoop’s distributed file system (HDFS) was used to achieve distributed storage while Apache Spark was used as the computing engine. The model development was based on a parallel model with full consideration of the high performance and throughput of distributed computing. Conclusively, the proposed HP-PL method can achieve good accuracy with less memory and time compared to the conventional methods of feature reduction. This tool can be publicly accessed at https://github.com/ahmed/Fast-HP-PL.

Słowa kluczowe

big data dimensionality reduction parallel processing Spark PCA LDA

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2019

Tom

Vol. 20 (4)

Strony

411--429

Opis fizyczny

Bibliogr. 30 poz., rys., tab.

Twórcy

autor

Ali Ahmed Hussein

ICCI, Informatics Institute for Postgradute Studies, Baghdad, Iraq

autor

Abdullah Mahmood Zaki

Computer Eng., College of Engineering, Mustansiriyah University, Baghdad, Iraq

Bibliografia

[1] Agarwal S., Ranjan P., Ujlayan A.: Comparative analysis of dimensionality reduction algorithms, case study: PCA. In:2017 11th International Conference on Intelligent Systems and Control (ISCO), pp. 255–259, IEEE, 2017.
[2] Akbar M.A. et al.: An Empirical Study for PCA and LDA Based Feature Reduction for Gas Identification,IEEE Sensors Journal, vol. 16(14), pp. 5734–5746, 2016.
[3] Ali A.H., Abdullah M.Z.: Recent Trends in Distributed Online Stream Processing Platform for Big Data: Survey. In:2018 1st Annual International Conference on Information and Sciences (AiCIS), pp. 140–145, IEEE, 2018.
[4] Apiletti D., Baralis E., Cerquitelli T., Garza P., Pulvirenti F., Michiardi P.:A Parallel MapReduce Algorithm to Efficiently Support Itemset Mining on High Dimensional Data, Big Data Research, vol. 10, pp. 53–69, 2017.
[5] Ayyad M., Khalid C.: New fusion of SVD and Relevance Weighted LDA for face recognition, Procedia Computer Science, vol. 148, pp. 380–388, 2019.
[6] Chen J., Li K., Tang Z., Bilal K., Yu S., Weng C., Li K.: A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment, IEEE Transactions on Parallel and Distributed Systems, vol. 28(4), pp. 919–933, 2016.
[7] Dahiya P., Srivastava D.K.: Network Intrusion Detection in Big Dataset Using Spark, Procedia Computer Science, vol. 132, pp. 253–262, 2018.
[8] Eleyan A., Demirel H.: PCA and LDA Based Face Recognition Using Feed-forward Neural Network Classifier. In:International Workshop on Multimedia Content Representation, Classification and Security, pp. 19–206, Springer, 2006.
[9] Ghamisi P., Benediktsson J.A.: Feature Selection Based on Hybridization of Genetic Algorithm and Particle Swarm Optimization, IEEE Geoscience and Remote Sensing Letters, vol. 12(2), pp. 309–313, 2015.
[10] Ghassabeh Y.A., Rudzicz F., Moghaddam H.A.: Fast incremental LDA feature extraction, Pattern Recognition, vol. 48(6), pp. 1999–2012, 2015.
[11] Gonzalez-Doḿınguez J., Bolon-Canedo V., Freire B., Tourino J.: Parallel feature selection for distributed-memory clusters, Information Sciences, vol. 496,pp. 399–409, 2019.
[12] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H.: The WEKA data mining software: an update, ACM SIGKDD Explorations Newsletter, vol. 11(1), pp. 10–18, 2009.
[13] Kim H., Drake B.L., Park H.: Multiclass classifiers based on dimension reduction with generalized LDA, Pattern Recognition, vol. 40(11), pp. 2939–2945, 2007.
[14] Liu P., Zhao H.-h., Teng J.-y., Yang Y.-y., Liu Y.-f., Zhu Z.-w.: Parallel naive Bayes algorithm for large-scale Chinese text classification based on a spark, Journal of Central South University, vol. 26(1), pp. 1–12, 2019.
[15] Lok U.-W., Song P., Trzasko J.D., Borisch E.A., Daigle R., Chen S.: Parallel Implementation of Randomized Singular Value Decomposition and Randomized Spatial Downsampling for Real-Time Ultrafast Microvessel Imaging on a Multi--Core CPUs Architecture. In:2018 IEEE International Ultrasonics Symposium(IUS), pp. 1–4, IEEE, 2018. https://doi.org/10.1109/ULTSYM.2018.8579678.
[16] Mallios X., Vassalos V., Venetis T., Vlachou A.: A Framework for Clustering and Classification of Big Data Using Spark. In: Debruyne C., Panetto H., Meersman R., Dillon T., Kuhn E., O’Sullivan D., Ardagna C.A. (eds.),On the Move to Meaningful Internet Systems: OTM 2016 Conferences. OTM 2016, Lecture Notes in Computer Science, vol. 10033, pp. 344–362, Cham, Springer, 2016.
[17] Mohammed M.A., Hasan R.A., Ahmed M.A., Tapus N., Shanan M.A., Khaleel M.K., Ali A.H.: A Focal load balancer based algorithm for task assignment in a cloud environment. In:2018 10th International Conference on Electronics, Computers and Artificial Intelligence (ECAI), pp. 1–4, IEEE, 2018.
[18] Nick W., Shelton J., Bullock G., Esterline A., Asamene K.: Comparing dimensionality reduction techniques. In: Southeast Con 2015, pp. 1–2, IEEE, 2015.
[19] Patil S.V., Kulkarni D.B.: A Review of Dimensionality Reduction in High--Dimensional Data Using Multi-core and Many-core Architecture. In: Workshop on Software Challenges to Exascale Computing, pp. 54–63, Springer, 2018.
[20] Ramirez-Gallego S., Lastra I., Martinez-Rego D., Bolon-Canedo V., Beniıtez J.M., Herrera F., Alonso-Betanzos A.: Fast-mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High-Dimensional Big Data, International Journal of Intelligent Systems, vol. 32(2), pp. 134–152, 2017.
[21] Raza M.S., Qamar U.:A parallel rough set based dependency calculation method for efficient feature selection, Applied Soft Computing, vol. 71,pp. 1020–1034, 2018.
[22] Sadaghiyanfam S., Kuntalp M.: Comparing the Performances of PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) Transformations on PAF (Paroxysmal Atrial Fibrillation) Patient Detection. In: Proceedings of the 2018 3rd International Conference on Biomedical Imaging, Signal Processing, pp. 1–5, ACM, 2018.
[23] Salih A.-H.A., Ali A.H., Hashim N.Y.: Jaya: An Evolutionary Optimization Technique for Obtaining the Optimal Dthr Value of Evolving Clustering Method (ECM),International Journal of Engineering Research and Technology, vol. 11(12), pp. 1901–1912, 2018.
[24] Sharma N., Saroha K.: A novel dimensionality reduction method for cancer dataset using PCA and Feature Ranking. In:2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI),pp. 2261–2264, IEEE, 2015.
[25] Szabo Z., Lorincz A.: Fast Parallel Estimation of High Dimensional Information Theoretical Quantities with Low Dimensional Random Projection Ensembles. In: International Conference on Independent Component Analysis and Signal Separation, pp. 146–153, Springer, 2009.
[26] Tanwar S., Ramani T., Tyagi S.: Dimensionality Reduction Using PCA and SVD in Big Data: A Comparative Case Study. In: Patel Z., Gupta S. (eds.), Future Internet Technologies and Trends. ICFITT 2017. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 220, pp. 116–125, Springer, Cham, 2017.
[27] Tayde S.S., Patil N.: A Novel Approach for Genome Data Classification Using Hadoop and Spark Framework. In: Emerging Research in Computing, Information, Communication, and Applications, pp. 333–343, Springer, 2016.
[28] Thanh H.C.: Parallel Dimensionality Reduction Transformation for Time-Series Data. In: 2009 First Asian Conference on Intelligent Information and Database Systems, pp. 104–108, IEEE, 2009.
[29] Wu Z., Li Y., Plaza A., Li J., Xiao F., Wei Z.: Parallel and Distributed Dimensionality Reduction of Hyperspectral Data on Cloud Computing Architectures, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9(6), pp. 2270–2278, 2016.
[30] Zhu Z., Ong Y.-S., Dash M.: Wrapper–filter Feature Selection Algorithm Using a Memetic Framework, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 37(1), pp. 70–76, 2007

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-91ab34de-48b8-438a-968d-eb8d6e63a91d