Exploiting multi-core and many-core parallelism for subspace clustering

Datta, Amitava; Kaur, Amardeep; Lauer, Tobias; Chabbouh, Sami

doi:10.2478/amcs-2019-0006

Artykuł - szczegóły

Tytuł artykułu

Exploiting multi-core and many-core parallelism for subspace clustering

Autorzy

Datta Amitava , Kaur Amardeep , Lauer Tobias , Chabbouh Sami

Treść / Zawartość

Pełne teksty:

06_datta_kaur_lauer_exploiting_multi_core_and_many_2019_1.pdf

Pobierz

Identyfikatory

DOI

10.2478/amcs-2019-0006

Warianty tytułu

Języki publikacji

Abstrakty

Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.

Słowa kluczowe

data mining subspace clustering multicore processor many core processor GPU computing

eksploracja danych procesor wielordzeniowy obliczenia GPU

Wydawca

Oficyna Wydawnicza Uniwersytetu Zielonogórskiego

Czasopismo

International Journal of Applied Mathematics and Computer Science

Rocznik

2019

Tom

Vol. 29, no. 1

Strony

81--91

Opis fizyczny

Bibliogr. 43 poz., rys., tab., wykr.

Twórcy

autor

Datta Amitava

School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia

autor

Kaur Amardeep

School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia

autor

Lauer Tobias

tobias.lauer@hs-offenburg.de

Department of Electrical Engineering and Information Technology, Offenburg University of Applied Sciences, Badstr. 24, 77652 Offenburg, Germany

autor

Chabbouh Sami

Department of Electrical Engineering and Information Technology, Offenburg University of Applied Sciences, Badstr. 24, 77652 Offenburg, Germany

Bibliografia

[1] Aggarwal, C.C. and Reddy, C.K. (2013). Data Clustering: Algorithms and Applications, 1st Edn., Chapman & Hall/CRC.
[2] Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C. and Park, J.S. (1999). Fast algorithms for projected clustering, SIGMOD Record 28(2): 61–72.
[3] Agrawal, R., Gehrke, J., Gunopulos, D. and Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications, ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA, Vol. 27, pp. 94–105.
[4] Alcantara, D.A.F. (2011). Efficient Hash Tables on the GPU, PhD thesis, University of California Davis, Davis, CA.
[5] Anderson, S.E. (2018). Bit Twiddling Hacks–compute the lexicographically next bit permutation, http://graphics.stanford.edu/˜seander/bithacks.html#NextBitPermutation.
[6] Berkhin, P. (2006). A survey of clustering data mining techniques, in J. Kogan et al. (Eds.), Grouping Multidimensional Data, Springer, Berlin/Heidelberg, pp. 25–71.
[7] Cheng, C.-H., Fu, A.W. and Zhang, Y. (1999). Entropy-based subspace clustering for mining numerical data, 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 84–93.
[8] Dagum, L. and Menon, R. (1998). OpenMP: An industry standard API for shared-memory programming, IEEE Computational Science Engineering 5(1): 46–55.
[9] Datta, A., Kaur, A., Lauer, T. and Chabbouh, S. (2017). Parallel subspace clustering using multi-core and many-core architectures, in M. Kirikova et al. (Eds.), New Trends in Databases and Information Systems, Springer International Publishing, Cham, pp. 213–223.
[10] Elhamifar, E. and Vidal, R. (2013). Sparse subspace clustering: Algorithm, theory, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11): 2765–2781.
[11] Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise, International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, pp. 226–231.
[12] Fan, J., Han, F. and Liu, H. (2014). Challenges of big data analysis, National Science Review 1(2): 293–314.
[13] Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press, San Diego, CA.
[14] Geiger, A., Lenz, P., Stiller, C. and Urtasun, R. (2013). Vision meets robotics: The KITTI dataset, The International Journal of Robotics Research 32(11): 1231–1237.
[15] Google Scholar (2018). Search for ‘data clustering’, https://scholar.google.com/scholar?q=data+clustering&btnG=.
[16] Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques, 3rd Edn., Morgan Kaufmann Publishers, San Francisco, CA.
[17] Harris, M., Sengupta, S. and Owens, J.D. (2007). Parallel prefix sum (scan) with CUDA, GPU Gems 3(39): 851–876.
[18] Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data, Prentice-Hall, Inc., Upper Saddle River, NJ.
[19] Jain, A.K., Murty, M.N. and Flynn, P.J. (1999). Data clustering: A review, ACM Computing Surveys 31(3): 264–323.
[20] Joliffe, I.T. (2002). Principle Component Analysis, 2nd Edn., Springer, New York, NY.
[21] Jun, J., Chung, S. and McLeod, D. (2006). Subspace clustering of microarray data based on domain transformation, VLDB Workshop on Data Mining and Bioinformatics, Seoul, Korea, pp. 14–28.
[22] Kailing, K., Kriegel, H.-P. and Kröger, P. (2004). Density-connected subspace clustering for high-dimensional data, SIAM International Conference on Data Mining, Lake Buena Vista, FL, USA, Vol. 4, pp. 246–256.
[23] Kaur, A. and Datta, A. (2014). Subscale: Fast and scalable subspace clustering for high dimensional data, IEEE International Conference on Data Mining Workshop, Shenzhen, China, pp. 621–628.
[24] Kaur, A. and Datta, A. (2015). A novel algorithm for fast and scalable subspace clustering of high-dimensional data, Journal of Big Data 2(1): 1–24.
[25] Kriegel, H.-P., Kröger, P. and Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Transactions on Knowledge Discovery from Data 3(1): 1–58.
[26] Li, T., Ma, S. and Ogihara, M. (2004). Document clustering via adaptive subspace iteration, 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 218–225.
[27] Lichman, M. (2013). UCI machine learning repository, http: //archive.ics.uci.edu/ml.
[28] Loughry, J., van Hemert, J. and Schoofs, L. (2000). Efficiently enumerating the subsets of a set, http://www.applied-math.org/subset.pdf.
[29] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations, 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, Vol. 1, pp. 281–297.
[30] McCaffrey, J. (2004). Generating the MTH lexicographical element of a mathematical combination, MSDN Library, Microsoft, Redmond, WA.
[31] Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms, The Computer Journal 26(4): 354–359.
[32] Nagesh, H., Goil, S. and Choudhary, A. (2001). Adaptive grids for clustering massive data sets, 1st SIAM International Conference on Data Mining, Chicago, IL, USA, pp. 1–17.
[33] Nvidia CUDA (2018). CUDA parallel computing platform and programming model, http://www.nvidia.com/object/cuda_home_new.html.
[34] Parsons, L., Haque, E. and Liu, H. (2004). Subspace clustering for high dimensional data: A review, ACM SIGKDD Explorations Newsletter 6(1): 90–105.
[35] Sim, K., Gopalkrishnan, V., Zimek, A. and Cong, G. (2013). A survey on enhanced subspace clustering, Data Mining and Knowledge Discovery 26(2): 332–397.
[36] Steinbach, M., Ertöz, L. and Kumar, V. (2004). The challenges of clustering high dimensional data, in L.T. Wille (Ed.), New Directions in Statistical Physics, Springer, Berlin/Heidelberg, pp. 273–309.
[37] Strohm, P.T., Wittmer, S., Haberstroh, A. and Lauer, T. (2015). GPU-accelerated quantification filters for analytical queries in multidimensional databases, in N. Bassiliades et al. (Eds.), New Trends in Databases and Information Systems II, Springer, Cham, pp. 229–242.
[38] Thalamuthu, A., Mukhopadhyay, I., Zheng, X. and Tseng, G.C. (2006). Evaluation and comparison of gene clustering methods in microarray analysis, Bioinformatics 22(19): 2405–2412.
[39] Tierney, S., Gao, J. and Guo, Y. (2014). Subspace clustering for sequential data, IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1019–1026.
[40] Xu, D. and Tian, Y. (2015). A comprehensive survey of clustering algorithms, Annals of Data Science 2(2): 165–193.
[41] Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms, IEEE Transactions on Neural Networks 16(3): 645–678.
[42] Zhu, B., Mara, A. and Mozo, A. (2015). CLUS: Parallel subspace clustering algorithm on spark, in T. Morzy et al. (Eds.), New Trends in Databases and Information Systems, Communications in Computer and Information Science, Vol. 539, Springer International Publishing, Cham, pp. 175–185.
[43] Zhu, J., Liao, S., Lei, Z., Yi, D. and Li, S.Z. (2013). Pedestrian attribute classification in surveillance: Database and evaluation, ICCV Workshop on Large-Scale Video Search and Mining (LSVSM’13), Sydney, Australia, pp. 331–338.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-68d9047c-4971-41cc-9c95-8069b4807b04