A Detailed Study of the Distributed Rough Set Based Locality Sensitive Hashing Feature Selection Technique.

Chelly Dagdia, Zaineb; Zarges, Christine

doi:10.3233/FI-2021-2069

Artykuł - szczegóły

Tytuł artykułu

A Detailed Study of the Distributed Rough Set Based Locality Sensitive Hashing Feature Selection Technique.

Autorzy

Chelly Dagdia Zaineb , Zarges Christine

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2021-2069

Warianty tytułu

Języki publikacji

Abstrakty

In the context of big data, granular computing has recently been implemented by some mathematical tools, especially Rough Set Theory (RST). As a key topic of rough set theory, feature selection has been investigated to adapt the related granular concepts of RST to deal with large amounts of data, leading to the development of the distributed RST version. However, despite of its scalability, the distributed RST version faces a key challenge tied to the partitioning of the feature search space in the distributed environment while guaranteeing data dependency. Therefore, in this manuscript, we propose a new distributed RST version based on Locality Sensitive Hashing (LSH), named LSH-dRST, for big data feature selection. LSH-dRST uses LSH to match similar features into the same bucket and maps the generated buckets into partitions to enable the splitting of the universe in a more efficient way. More precisely, in this paper, we perform a detailed analysis of the performance of LSH-dRST by comparing it to the standard distributed RST version, which is based on a random partitioning of the universe. We demonstrate that our LSH-dRST is scalable when dealing with large amounts of data. We also demonstrate that LSH-dRST ensures the partitioning of the high dimensional feature search space in a more reliable way; hence better preserving data dependency in the distributed environment and ensuring a lower computational cost.

Słowa kluczowe

Granular Computing Rough Set Theory Big Data; Feature Selection Locality Sensitive Hashing Distributed Processing.

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2021

Tom

Vol. 182, nr 2

Strony

111--179

Opis fizyczny

Bibliogr. 80 poz. rys., tab., wykr.

Twórcy

autor

Chelly Dagdia Zaineb

zaineb.chelly-dagdia@uvsq.fr

Université Paris-Saclay, UVSQ, DAVID

autor

Zarges Christine

c.zarges@aber.ac.uk

Department of Computer Science, Aberystwyth University, Aberystwyth, United Kingdom.

Bibliografia

[1] Pedrycz W, Skowron A, Kreinovich V. Handbook of granular computing. John Wiley & Sons, 2008. ISBN 978-0-470-03554-2.
[2] Yao Y, et al. Granular computing: basic issues and possible solutions. In: Proceedings of the 5th joint conference on information sciences, volume 1. 2000 pp. 186-189. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.7509.
[3] Lin TY, Yao YY, Zadeh LA. Data mining, rough sets and granular computing, volume 95. Physica, 2013. ISBN 978-3-7908-1791-1.
[4] Tsumoto S, Hirano S, Inuiguchi M. Workshop on Rough Set Theory and Granular Computing-Summary. In: Annual Conference of the Japanese Society for Artificial Intelligence. Springer, 2001 pp. 239-239. doi:https://doi.org/10.1007/3-540-45548-5_26.
[5] Zadeh LA. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy sets and systems, 1997. 90(2):111-127. doi:https://doi.org/10.1016/S0165-0114(97)00077-8.
[6] Wang H, Xu Z, Pedrycz W. An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities. Knowledge-Based Systems, 2017. 118:15-30. doi:https://doi.org/10.1016/j.knosys.2016.11.008.
[7] Mukkamala RR, Hussain A, Vatrapu R. Fuzzy-set based sentiment analysis of big social data. In: 2014 IEEE 18th International Enterprise Distributed Object Computing Conference. IEEE, 2014 pp. 71-80. doi:10.1109/EDOC.2014.19.
[8] Zhang J, Li T, Pan Y. Parallel rough set based knowledge acquisition using MapReduce from big data. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. ACM, 2012 pp. 20-27. doi:https://doi.org/10.1145/2351316.2351320.
[9] Dagdia ZC, Zarges C, Schannes B, Micalef M, Galiana L, Rolland B, de Fresnoye O, Benchoufi M. Rough set theory as a data mining technique: A case study in epidemiology and cancer incidence prediction. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018 pp. 440-455. doi:https://doi.org/10.1007/978-3-030-10997-4_27.
[10] Hamidinekoo A, Dagdia ZC, Suhail Z, Zwiggelaar R. Distributed Rough Set Based Feature Selection Approach to Analyse Deep and Hand-crafted Features for Mammography Mass Classification. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018 pp. 2423-2432. doi:10.1109/BigData. 2018.8621962.
[11] Dagdia ZC, Zarges C, Beck G, Lebbah M. A distributed rough set theory based algorithm for an efficient big data pre-processing under the spark framework. In: 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017 pp. 911-916. doi:10.1109/BigData.2017.8258008.
[12] Dagdia ZC, Zarges C, Beck G, Lebbah M. Nouveau Modèle de Sélection de Caractéristiques basé sur la Théorie des Ensembles Approximatifs pour les Données Massives: Méthode de sélection de caractéristiques pour les données massives. In: Conférence Internationalle sur l’Extraction et la Gestion des Connaissances. 2018 pp. 377-378. URL https://editions-rnti.fr/index.php?inprocid=1002409&lg=en.
[13] Düntsch I, Gediga G. Rough set data analysis. Encyclopedia of Computer Science and Technology, 2000. 43(28):281-301. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.8595&rep=rep1&type=pdf.
[14] Dagdia ZC, Zarges C, Beck G, Lebbah M. A Scalable and Effective Rough Set Theory based Approach for Big Data Pre-processing. Knowledge and Information Systems, 2020. 2020. doi:https://doi.org/10.1007/s10115-020-01467-y.
[15] Grzegorowski M, ´Sl˛ezak D. On resilient feature selection: Computational foundations of rC-reducts. Information Sciences, 2019. 499:25-44. URL https://doi.org/10.1016/j.ins.2019.05.041.
[16] Thangavel K, Pethalakshmi A. Dimensionality reduction based on rough set theory: A review. Applied Soft Computing, 2009. 9(1):1-12. URL https://doi.org/10.1016/j.asoc.2008.05.006.
[17] Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. In: VLDB. 1999 pp. 518-529. URL https://dl.acm.org/doi/10.5555/645925.671516.
[18] Liu W, Wang J, Ji R, Jiang YG, Chang SF. Supervised hashing with kernels. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012 pp. 2074-2081. doi:10.1109/CVPR.2012.6247912.
[19] Weiss Y, Torralba A, Fergus R. Spectral hashing. In: Advances in neural information processing systems. 2009 pp. 1753-1760. URL http://papers.nips.cc/paper/3383-spectral-hashing.pdf.
[20] Cai D. A revisit of hashing algorithms for approximate nearest neighbor search. arXiv preprint arXiv:1612.07545, 2016. URL https://arxiv.org/abs/1612.07545.
[21] Dagdia ZC, Zarges C, Beck G, Azzag H, Lebbah M. A Distributed Rough Set Theory Algorithm based on Locality Sensitive Hashing for an Efficient Big Data Pre-processing. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018 pp. 2597-2606. doi:10.1109/BigData.2018.8622024.
[22] Liu H, Motoda H, Setiono R, Zhao Z. Feature selection: An ever evolving frontier in data mining. In: Feature Selection in Data Mining. 2010 pp. 4-13. URL http://proceedings.mlr.press/v10/.
[23] Liu H, Zhao Z. Manipulating data and dimension reduction methods: feature selection. In: Computational Complexity, pp. 1790-1800. Springer, 2012. doi:10.1007/978-0-387-30440-3_317.
[24] Bolón-Canedo V, Rego-Fernández D, Peteiro-Barral D, Alonso-Betanzos A, Guijarro-Berdiñas B, Sánchez-Maroño N. On the scalability of feature selection methods on high-dimensional data. Knowledge and Information Systems, 2018. pp. 1-48. doi:https://doi.org/10.1007/s10115-017-1140-3.
[25] Dean J, Ghemawat S. MapReduce: a flexible data processing tool. Communications of the ACM, 2010. 53(1):72-77. doi:10.1145/1629175.1629198.
[26] Vinh NX, Chan J, Romano S, Bailey J, Leckie C, Ramamohanarao K, Pei J. Discovering outlying aspects in large datasets. Data Mining and Knowledge Discovery, 2016. 30(6):1520-1555. doi:10.1007/s10618-016-0453-2.
[27] Zhang J, Wang S, Chen L, Gallinari P. Multiple Bayesian discriminant functions for high-dimensional massive data classification. Data Mining and Knowledge Discovery, 2017. 31(2):465-501. doi:10.1007/s10618-016-0481-y.
[28] Zhai T, Gao Y, Wang H, Cao L. Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Mining and Knowledge Discovery, 2017. pp. 1-24. doi:10.1007/s10618-017-0500-7.
[29] Shanahan JG, Dai L. Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015 pp. 2323-2324. doi:10.1145/2783258.2789993.
[30] Peralta D, del Río S, Ramírez-Gallego S, Triguero I, Benitez JM, Herrera F. Evolutionary feature selection for big data classification: A mapreduce approach. Mathematical Problems in Engineering, 2015. 2015.URL https://doi.org/10.1155/2015/246139.
[31] Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 2005. 27(8):1226-1238. doi:10.1109/TPAMI.2005.159.
[32] Datar M, Immorlica N, Indyk P, Mirrokni VS. Locality-sensitive Hashing Scheme Based on P-stable Distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG ’04. ACM, New York, NY, USA. ISBN 1-58113-885-7, 2004 pp. 253-262. doi:10.1145/997817.997857.
[33] Charikar MS. Similarity Estimation Techniques from Rounding Algorithms. In: Proceedings of the Thiry fourth Annual ACM Symposium on Theory of Computing, STOC ’02. ACM, New York, NY, USA. ISBN 1-58113-495-9, 2002 pp. 380-388. doi:10.1145/509907.509965.
[34] Indyk P, Motwani R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98. ACM, New York, NY, USA. ISBN 0-89791-962-9, 1998 pp. 604-613. doi:10.1145/276698.276876.
[35] Broder A. On the Resemblance and Containment of Documents. In: Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97. IEEE Computer Society, Washington, DC, USA. ISBN 0-8186-8132-2, 1997 pp. 21-29.
[36] Weiss Y, Torralba A, Fergus R. Spectral Hashing. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’08. Curran Associates Inc., USA. ISBN 978-1-6056-0-949-2, 2008 pp. 1753-1760.
[37] He J, Liu W, Chang SF. Scalable Similarity Search with Optimized Kernel Hashing. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10. ACM, New York, NY, USA. ISBN 978-1-4503-0055-1, 2010 pp. 1129-1138. doi:10.1145/1835804.1835946.
[38] He J, Radhakrishnan R, Chang SF, Bauer C. Compact Hashing with Joint Optimization of Search Accuracy and Time. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11. IEEE Computer Society, Washington, DC, USA. ISBN 978-1-4577-0394-2, 2011 pp. 753-760. doi:10.1109/CVPR.2011.5995518.
[39] Wang J, Shen HT, Song J, Ji J. Hashing for Similarity Search: A Survey. CoRR, 2014. abs/1408.2927.1408.2927.
[40] Pawlak Z, Skowron A. Rudiments of rough sets. Information sciences, 2007. 177(1):3-27. doi:10.1016/j.ins.2006.06.003.
[41] Sakr S, Liu A, Batista DM, Alomari M. A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 2011. 13(3):311-336. doi:10.1109/SURV.2011.032211.00087.
[42] Snir M. MPI-the Complete Reference: the MPI core, volume 1. MIT press, 1998. ISBN 978-0-262-69215-1.
[43] Fernández A, del Río S, López V, Bawakid A, del Jesus MJ, Benítez JM, Herrera F. Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2014. 4(5):380-409. doi: 10.1002/widm.1134.
[44] https://sparkapacheorg/mllib/. MLlib Website.
[45] Xu X, Jäger J, Kriegel HP. A fast parallel clustering algorithm for large spatial databases. In: High Performance Data Mining, pp. 263-290. Springer, 1999. doi:10.1023/A:1009884809343.
[46] Dua D, Graff C. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml.
[47] Labrinidis A, Jagadish HV. Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 2012. 5(12):2032-2033. doi:10.14778/2367502.2367572. URL https://www.vldb.org/pvldb/vol5.html.
[48] Fan W, Bifet A. Mining big data: current status, and forecast to the future. ACM sIGKDD Explorations Newsletter, 2013. 14(2):1-5. doi:10.1145/2481244.2481246.
[49] Afendi FM, Ono N, Nakamura Y, Nakamura K, Darusman LK, Kibinge N, Morita AH, Tanaka K, Horai H, Altaf-Ul-Amin M, et al. Data mining methods for omics and knowledge of crude medicinal plants toward big data biology. Computational and Structural Biotechnology Journal, 2013. 4(5):1-14. doi:10.5936/csbj.201301010.
[50] Wu X, Zhu X, Wu GQ, Ding W. Data mining with big data. IEEE transactions on knowledge and data engineering, 2014. 26(1):97-107. doi:10.1109/TKDE.2013.109.
[51] Chen M, Mao S, Liu Y. Big data: A survey. Mobile Networks and Applications, 2014. 19(2):171-209. doi:10.1007/s11036-013-0489-0.
[52] Larose DT. Discovering knowledge in data: an introduction to data mining. John Wiley & Sons, 2014. ISBN 978-0-470-90874-7.
[53] Grzymala-Busse JW, Ziarko W. Data Mining and Rough Set Theory. Commun. ACM, 2000. 43(4):108-109. doi:10.1145/332051.332082.
[54] Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 2005. 17(4):491-502. doi:10.1109/TKDE.2005.66.
[55] Talukder N, Zaki MJ. A distributed approach for graph mining in massive networks. Data Mining and Knowledge Discovery, 2016. 30(5):1024-1052. doi:10.1007/s10618-016-0466-x.
[56] Schäfer P. Scalable time series classification. Data Mining and Knowledge Discovery, 2016. 30(5):1273-1298. doi:10.1007/s10618-015-0441-y.
57] Qian Y, Liang J, Pedrycz W, Dang C. Positive approximation: An accelerator for attribute reduction in rough set theory. Artificial Intelligence, 2010. 174(9):597 - 618. doi:10.1016/j.artint.2010.04.018.
[58] Schneider J, Vlachos M. Scalable density-based clustering with quality guarantees using random projections. Data Mining and Knowledge Discovery, 2017. pp. 1-34. doi:10.1007/s10618-017-0498-x.
[59] White T. Hadoop: The definitive guide. " O’Reilly Media, Inc.", 2012. ISBN 978-1-449-31152-0.
[60] Tan M, Tsang IW, Wang L. Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 2014. 15:1371-1429. URL http://jmlr.org/papers/v15/.
[61] Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of machine learning research, 2003. 3(Mar):1157-1182. URL http://www.jmlr.org/papers/v3/.
[62] Dash M, Liu H. Feature selection for classification. Intelligent data analysis, 1997. 1(1-4):131-156. doi:10.1016/S1088-467X(97)00008-5.
[63] Keogh E, Mueen A. Curse of dimensionality. In: Encyclopedia of Machine Learning, pp. 257-258. Springer, 2011. doi:10.1007/978-1-4899-7687-1_192.
[64] Pawlak Z. Rough sets: Theoretical aspects of reasoning about data, volume 9. Springer Science & Business Media, 2012. ISBN 978-94-011-3534-4.
[65] Wang X, Yang J, Teng X, Xia W, Jensen R. Feature selection based on rough sets and particle swarm optimization. Pattern recognition letters, 2007. 28(4):459-471. doi:10.1016/j.patrec.2006.09.003.
[66] Swiniarski RW, Skowron A. Rough set methods in feature selection and recognition. Pattern recognition letters, 2003. 24(6):833-849. doi:10.1016/S0167-8655(02)00196-4.
[67] Guyon I, Elisseeff A. An introduction to feature extraction. Feature extraction, 2006. pp. 1-25. doi: 10.1007/978-3-540-35488-8_1.
[68] John GH, Kohavi R, Pfleger K, et al. Irrelevant features and the subset selection problem. In: Machine learning: proceedings of the eleventh international conference. 1994 pp. 121-129. doi:10.1016/B978-1-55860-335-6.50023-4.
[69] Prinzie A, Van den Poel D. Random forests for multiclass classification: Random multinomial logit. Expert systems with Applications, 2008. 34(3):1721-1732. doi:10.1016/j.eswa.2007.01.029.
[70] Ahmed S, Zhang M, Peng L. Enhanced feature selection for biomarker discovery in LC-MS data using GP. In: Evolutionary Computation (CEC), 2013 IEEE Congress on. IEEE, 2013 pp. 584-591. doi:10.1109/CEC.2013.6557621.
[71] Aghdam MH, Ghasem-Aghaee N, Basiri ME. Text feature selection using ant colony optimization. Expert systems with applications, 2009. 36(3):6843-6853. doi:10.1016/j.eswa.2008.08.022.
[72] Ghosh A, Datta A, Ghosh S. Self-adaptive differential evolution for feature selection in hyperspectral image data. Applied Soft Computing, 2013. 13(4):1969-1977. doi:10.1016/j.asoc.2012.11.042. [73] Lingras P. Unsupervised rough set classification using GAs. Journal of Intelligent Information Systems, 2001. 16(3):215-228. doi:10.1023/A:1011219918340.
[74] Lingras P. Rough set clustering for web mining. In: Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference on, volume 2. IEEE, 2002 pp. 1039-1044. doi:10.1109/FUZZ.2002.1006647.
[75] Bai C, Sarkis J. Integrating sustainability into supplier selection with grey system and rough set methodologies. International Journal of Production Economics, 2010. 124(1):252-264. doi:10.1016/j.ijpe.2009.11.023.
[76] Guller M. Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis. Springer, 2015. ISBN 978-1-4842-0965-3. doi:10.1007/978-1-4842-0964-6.
[77] Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998 pp. 604-613. doi:10.1145/276698.276876.
[78] Charikar MS. Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002 pp. 380-388. doi:doi.org/10.1145/509907.509965.
[79] Lv Q, Josephson W, Wang Z, Charikar M, Li K. Ferret: a toolkit for content-based similarity search of feature-rich data. ACM SIGOPS Operating Systems Review, 2006. 40(4):317-330. doi:10.1145/1218063.1217966.
[80] Wang J, Shen HT, Song J, Ji J. Hashing for Similarity Search: A Survey. CoRR, 2014. abs/1408.2927.1408.2927, URL http://arxiv.org/abs/1408.2927.

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023). (PL)

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-1754a616-327d-4533-8ca5-940230004c41