A step towards the majority-based clustering validation decision fusion method

Panskyi, Taras; Mosorov, Volodymyr

doi:10.35784/iapgos.2596

Artykuł - szczegóły

Tytuł artykułu

A step towards the majority-based clustering validation decision fusion method

Autorzy

Panskyi Taras , Mosorov Volodymyr

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.35784/iapgos.2596

Warianty tytułu

Krok w kierunku metodyfuzji decyzji opartej na większości dla walidacji wyników klasteryzacji

Języki publikacji

Abstrakty

A variety of clustering validation indices (CVIs) are aimed at validating the results of clustering analysis and determining which clustering algorithm performs best. Different validation indices may be appropriate for different clustering algorithms or partition dissimilarity measures; however, the best suitable index to use in practice remains unknown. A single CVI is generally unable to handle the wide variability and scalability of the data and cope successfully with all the contexts. Therefore, one of the popular approaches is to use a combination of multiple CVIs and fuse their votes into the final decision. This work aims to analyze the majority-based decision fusion method. Thus, the experimental work consisted of designing and implementing the NbClust majority-based decision fusion method and then evaluating the CVIs performance with different clustering algorithms and dissimilarity measures to discover the best validation configuration. Moreover, the authors proposed to enhance the standard majority-based decision fusion method with straightforward rules for the maximum efficiency of the validation procedure. The result showed that the designed enhanced method with an invasive validation configuration could cope with almost all data sets (99%) with different experimental factors (density, dimensionality, number of clusters, etc.).

Różnorodne indeksy walidacji klasteryzacji (CVI) mają na celu walidację wyników analizy skupień i określenie, który algorytm klasteryzacji działa najlepiej. Różne indeksy walidacji mogą być odpowiednie dla różnych algorytmów klasteryzacji lub miar niepodobieństwa podziału; jednak najlepszy walidacyjny indeks do zastosowania w praktyce pozostaje nieznany. Pojedynczy CVI na ogół nie jest w stanie poradzić sobie z dużą zmiennością i skalowalnością danych oraz z powodzeniem poradzić sobie we wszystkich kontekstach. Dlatego jednym z popularnych podejść jest użycie kombinacji wielu CVIs i połączenie ich głosów w ostateczną decyzję. Celem tej pracy jest analiza metody fuzji decyzji opartej na większości. W związku z tym prace eksperymentalne polegały na zaprojektowaniu i wdrożeniu metody NbClust fuzji decyzji opartej na większości, a następnie ocenianie wydajności CVIs za pomocą różnych algorytmów klasteryzacji i miar niepodobieństwa w celu odkrycia najlepszej konfiguracji walidacji. Ponadto autor zaproponował rozszerzenie standardowej metody fuzji decyzji oparta na większości o proste reguły dla maksymalnej efektywności procedury walidacji. Wynik pokazał, że zaprojektowana ulepszona metoda z inwazyjną konfiguracją walidacji może poradzić sobie z prawie wszystkimi zbiorami danych (99%) z różnymi eksperymentalnymi parametrami (gęstość, wymiarowość, liczba klastrów itp.).

Słowa kluczowe

clustering clustering validation index decision fusion method

klasteryzacja indeks walidacji klasteryzacji metoda fuzji decyzji

Wydawca

Wydawnictwo Politechniki Lubelskiej

Czasopismo

Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska

Rocznik

2021

Tom

T. 11, nr 2

Strony

4--13

Opis fizyczny

Bibliogr. 72 poz., wykr.

Twórcy

autor

Panskyi Taras

tpanski@kis.p.lodz.pl

Lodz University of Technology, Institute of Applied Computer Science, Lodz, Poland

https://orcid.org/0000-0002-0416-8711+

autor

Mosorov Volodymyr

w.mosorow@kis.p.lodz.pl

Lodz University of Technology, Institute of Applied Computer Science, Lodz, Poland

https://orcid.org/0000-0001-6016-8671

Bibliografia

[1] Akoglu L., Tong H., Koutra D.: Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 29(3), 2015, 626–688.
[2] Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J., Perona I.: An extensive comparative study of cluster validity indices. Pattern Recognition 46(1), 2013, 243–256.
[3] Bailey K.D.: Typologies and Taxonomies: An introduction to classification techniques (quantitative applications in the social sciences). SAGE Publications, Thousand Oaks 1994.
[4] Ball G.H., Hall D.J.: ISODATA, a Novel Method of Data Analysis and Pattern Classification. Stanford Research Institute 1965.
[5] Bandyopadhyay S., Maulik U: Nonparametric genetic clustering: comparison of validity indices. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31(1), 2001, 120–125.
[6] Beale E.M.L.: Cluster Analysis. Scientific Control Systems, London 1969.
[7] Bezdek J., Li W., Attikiouzel Y., Windham M.: A geometric approach to cluster validity for normal mixtures. Soft Computing – A Fusion of Foundations, Methodologies and Applications 1(4), 1997, 166 –179.
[8] Bezdek J., Pal N.: Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 28(3), 1998, 301–315.
[9] Berkhin P.: A Survey of Clustering Data Mining Techniques. Grouping Multidimensional Data. Springer, Berlin 2006.
[10] Braune C., Besecke S., Kruse R.: Density Based Clustering: Alternatives to DBSCAN, Partitional Clustering Algorithms. Springer, Cham 2014.
[11] Brock G., Pihur V., Datta S., Datta S.: clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), 2008, 1–22.
[12] Brun M., Sima C., Hua J., Lowey J., Carroll B., Suh E., Dougherty E.: Model based evaluation of clustering validation measures. Pattern Recognition 40(3), 2007, 807–824.
[13] Calinski T., Harabasz J.: A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods 3(1), 1974, 1–27.
[14] Cannataro M., Congiusta A., Mastroianni C., Pugliese A., Talia D., Trunfio P.: Grid-Based Data Mining and Knowledge Discovery. Intelligent Technologies for Information Analysis. Springer, Berlin 2004.
[15] Celebi M.: Partitional clustering algorithms. Springer, Cham 2015.
[16] Charrad M., Ghazzali N., Boiteau V., Niknafs A.: NbClust: AnRPackage for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61(6), 2014, 1–36.
[17] Cho K., Lee J.: Grid-Based and Outlier Detection-Based Data Clustering and Classification. Communications in Computer and Information Science. Springer, Berlin 2011.
[18] Chou C., Su M., Lai E.: A new cluster validity measure and its application to image compression. Pattern Analysis and Applications 7(2), 2004, 205–220.
[19] Davies D., Bouldin D.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 1979, 224–227.
[20] Deng M., Liu Q., Cheng T., Shi Y.: An Adaptive Spatial Clustering Algorithm Based On Delaunay Triangulation. Computers, Environment and Urban Systems 35, 2011, 320–332.
[21] Dimitriadou E.: cclust: Convex Clustering Methods and Clustering Indexes. R package version 0.6-18, 2014.
[22] Dimitriadou E., Dolňicar S., Weingessel A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1), 2002, 137–159.
[23] Dubes R.: How many clusters are best? – An experiment. Pattern Recognition 20(6), 1987, 645–663.
[24] Duda R., Hart P: Pattern classification and scene analysis. Wiley, New York 1973.
[25] Duda R, Hart P., Stork D.: Pattern classification. Wiley, New York 2001.
[26] Dunn J.: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4(1), 1974, 95–104.
[27] Embrechts E., Gatti C., Linton J., Roysam B.: Hierarchical Clustering for Large Data Sets. Advances in Intelligent Signal Processing and Data Mining. Springer, Berlin 2013.
[28] Estivill-Castro V., Lee I.: Argument Free Clustering For Large Spatial Point Data Sets Via Boundary Extraction From Delaunay Diagram. Computers, Environment and Urban Systems 26, 2002, 315–334.
[29] Fränti P., Mariescu-Istodor R., Zhong C.: XNN Graph, Lecture Notes in Computer Science, 10029, 2016, 207–217.
[30] Frey T., van Groenewoud H.: A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle. The Journal of Ecology 60(3), 1972, 873–886.
[31] Friedman H., Rubin J.: On Some Invariant Criteria for Grouping Data. Journal of the American Statistical Association 62(320), 1967, 1159–1178.
[32] Granichin O., Volkovich Z., Toledano-Kitai D.: Cluster Validation. Intelligent Systems Reference Library. Springer, Berlin 2015.
[33] Gurrutxaga I., Muguerza J., Arbelaitz O., Pérez J., Martín J.: Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognition Letters 32(3), 2011, 505–515.
[34] Halim Z., J. Khattak J.: Density-based clustering of big probabilistic graphs. Evolving Systems 10, 2019, 333–350.
[35] Halkidi M., Batistakis Y., Vazirgiannis M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2/3), 2001, 107–145.
[36] Handl J., Knowles J.: Multi-Objective Clustering and Cluster Validation. Studies in Computational Intelligence. Springer, Berlin 2006.
[37] Halkidi M., Vazirgiannis M.: A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6), 2008, 773–786.
[38] Halkidi M., Vazirgiannis M.: Clustering validity assessment: finding the optimal partitioning of a data set. Proceedings 2001 IEEE International Conference on Data Mining. IEEE, San Jose 2001.
[39] Halkidi M., Vazirgiannis M., Batistakis Y.: Quality Scheme Assessment in the Clustering Process. Lecture Notes in Computer Science. Springer, Berlin 2000.
[40] Hartigan J.A.: Clustering Algorithms. John Wiley & Sons, New York 1975.
[41] Hennig C.: Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 2010, 3–34.
[42] Hornik K.: A CLUE for CLUster Ensembles. Journal of Statistical Software 14(12), 2005, 1–25.
[43] Hubert L., Levin J.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1976, 1072–1080.
[44] Kryszczuk K., Hurley P.: Estimation of the Number of Clusters Using Multiple Clustering Validity Indices. Lecture Notes in Computer Science, Springer, Berlin 2010.
[45] Krzanowski W., Lai Y.: A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics 44(1), 1988, 23–34.
[46] Lu J., Zhang G., Ruan D., Wu F.: Multi-objective group decision making: methods, software and applications with fuzzy set techniques. Imperial College Press, London 2007.
[47] Maalel W., Zhou K., Martin A., Elouedi Z.: Belief Hierarchical Clustering, Belief Functions: Theory and Applications. Lecture Notes in Computer Science. Springer, Cham 2014.
[48] Marriott F.: Practical Problems in a Method of Cluster Analysis. Biometrics 27(3), 1971, 501–514.
[49] McClain J., Rao V.: CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects. Journal of Marketing Research 12(4), 1975, 456–460.
[50] Meyer D., Dimitriadou E., Hornik K., Weingessel A., Leisch F.: E1071: Misc Functions of the Department of Statistics, Probability Theory Group. R package version 1.6-8, 2017.
[51] Milligan G.: An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3), 1980, 325–342.
[52] Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 1985, 159–179.
[53] Nerurkar P., Pavate A., Shah M., Jacob S.: Performance of Internal Cluster Validations Measures for Evolutionary Clustering. Advances in Intelligent Systems and Computing. Springer, Singapore 2018.
[54] Nieweglowski L.: clv: Cluster Validation Techniques. R package version 0.3-2.1, 2014.
[55] Oliveira J., Pedrycz W.: Advances in fuzzy clustering and its applications. John Wiley & Sons Ltd, Chichester 2007.
[56] Peng Q., Wang Y., Ou G., Tian Y., Huang L., Pang W.: Partitioning Clustering Based on Support Vector Ranking. Lecture Notes in Computer Science. Springer, Cham 2016.
[57] Ratkowsky D.A., Lance G.N.: A Criterion for Determining the Number of Groups in a Classification. Australian Computer Journal 10(3), 1978, 115–117.
[58] Rezaei M., Fränti P.: Set Matching Measures for External Cluster Validity. IEEE Transactions on Knowledge and Data Engineering 28(8), 2016, 2173–2186.
[59] Rousseeuw P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 1987, 53–65.
[60] Roux M.: A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms. Journal of Classification 35(2), 2018, 345–366.
[61] Sarle W.S.: Cubic Clustering Criterion, SAS Technical Report A-108. SAS Institute Inc, Cary 1983.
[62] Saemi B., Hosseinabadi A., Kardgar M., Balas V., Ebadi H.: Nature Inspired Partitioning Clustering Algorithms: A Review and Analysis. Advances in Intelligent Systems and Computing. Springer, Cham 2017.
[63] Scott A., Symons M.: Clustering Methods Based on Likelihood Ratio Criteria. Biometrics 27(2), 1971, 387–397.
[64] Shim Y., Chung J., Choi I.: A Comparison Study of Cluster Validity Indices Using a Nonhierarchical Clustering Algorithm. International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06). IEEE, Vienna 2005.
[65] Steinley D., Henson R.: OCLUS: An Analytic Method for Generating Clusters with Known Overlap. Journal of Classification 22(2), 2005, 221–250.
[66] Tan P., Steinbach M., Kumar V.: Introduction to data mining. Pearson, 2005.
[67] Vathy-Fogarassy A., Abonyi J.: Graph-Based Clustering and Data Visualization Algorithms. Springer, London 2013.
[68] Walesiak M., Dudek A.: clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.43-4, 2014.
[69] Yera A., Arbelaitz O., Jodra J., Gurrutxaga I., Pérez J., Muguerza J.: Analysis of several decision fusion strategies for clustering validation. Strategy definition, experiments and validation. Pattern Recognition Letters 85, 2017, 42–48.
[70] Zahn C.: Graph-Theoretical Methods For Detecting And Describing Gestalt Clusters. IEEE Transactions on Computers C-20, 1971, 68–86.
[71] Žalik K., Žalik B.: Validity index for clusters of different sizes and densities. Pattern Recognition Letters 32(2), 2011, 221–234.
[72] Zhong C., Miao D., Wang R.: A Graph-Theoretical Clustering Method Based On Two Rounds Of Minimum Spanning Trees. Pattern Recognition 43, 2010, 752–766.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-a792594f-7072-4593-ac24-19c40e0823e3