An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework

Yang, Y.; Rutayisire, T.; Lin, C.; Li, C.; Teng, F.

Artykuł - szczegóły

Tytuł artykułu

An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework

Autorzy

Yang Y. , Rutayisire T. , Lin C. , Li C. , Teng F.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Clustering with pairwise constraints has received much attention in the clustering community recently. Particularly, must-link and cannot-link constraints between a given pair of instances in the data set are common prior knowledge incorporated in many clustering algorithms today. This approach has been shown to be successful in guiding a number of famous clustering algorithms towards more accurate results. However, recent work has also shown that the incorporation of must-link and cannot-link constraints makes clustering algorithms too much sensitive to “the assignment order of instances” and therefore results in consequent constraint violation. The major contributions of this paper are two folds. One is to address the issue of constraint violation in Cop-Kmeans by emphasizing a sequenced assignment of cannot-link instances after conducting a Breadth-First Search of the cannot-link set. The other is to reduce the computational complexity of Cop-Kmeans for massive data sets by adopting a MapReduce Framework. Experimental results show that our approach performs well on massive data sets while may overcome the problem of constraint violation.

Słowa kluczowe

constraint violation semi-supervised clustering

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2013

Tom

Vol. 126, nr 4

Strony

301--318

Opis fizyczny

Bibliogr. 31 poz., tab., wykr.

Twórcy

autor

Yang Y.

yyang@swjtu.edu.cn

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, P.R. China

autor

Rutayisire T.

rutantonio14@yahoo.com

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, P.R. China

autor

Lin C.

linchao0916@126.com

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, P.R. China

autor

Li C.

trli@swjtu.edu.cn

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, P.R. China

autor

Teng F.

fteng@swjtu.edu.cn

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, P.R. China

Bibliografia

[1] Davidson, I., Basu, S.:A survey of clustering with instance level constraints, ACM Transactions on Knowledge Discovery on Data, 2007, 1-41.
[2] Wang G., Wang Y.: 3DM: Domain-oriented Data-driven Data Mining, Fundamenta Informaticae, 90, 2009, 395-426.
[3] Yu J., Yang M., Hao P.: A Novel Multimodal Probability Model for Cluster Analysis, Fundamenta Informaticae, 111, 2011, 81-90.
[4] Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means clustering with background knowledge, Proc. International Conference on Machine Learning, 2001, 577-584.
[5] Basu S., Banerjee A., Mooney R.J.: Semi-Supervised clustering by seeding, Proc. 19th International Conference on Machine Learning, 2002, 19-26.
[6] Xing E.P., Ng A.Y., Jordan M.I., et al.: Distance metric learning, with application to clustering with side- information, Advances in Neural Information Processing Systems, 2003, 505-512.
[7] Saha. S., Bandyopadhyay S.: Semi-GAPS: A Semi-supervised Clustering Method, Fundamenta Informaticae, 96, 2009, 195C209.
[8] Wagstaff, K., Cardie, C.: Clustering with instance level constraints, Proc. International Conference on Machine Learning, 2000, 1103-1110.
[9] Davidson, I., Ravi, S.S.: Clustering with constraints: Feasibility issues and the k-means algorithm, Proc. SIAM International Conference on Data Mining, 2005, 138-149.
[10] Sun Y., Liu M., Wu C.: A Modified K-means Algorithm for Clustering Problem with Balancing Constraints, Proc. 3rd International Conference on Measuring Technology and Mechatronics Automation, 2011, 127-130.
[11] Schmidt J., Brandle E. M., Kramer S.: Clustering with Attribute-Level Constraints, Proc. 11th IEEE International Conference on Data Mining, 2011, 1206-1211.
[12] Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering, Proc. SIAM International Conference on Data Mining, 2004, 333-344.
[13] Wagstaff, K.: Intelligent clustering with instance-level constraints, Cornell University, 2002.
[14] Tan, W., Yang, Y., Li, T.: An improved COP-KMeans algorithm for solving constraint violation, Proc. International FLINS Conference on Foundations and Applications of Computational Intelligence, 2010, 690696.
[15] Anthony, K., Han, J., Raymond, T.: Constraint-Based Clustering in Large Databases, Proc. International Conference on Database Theory, 2001, 405-419.
[16] Rutayisire T., Yang Y., Lin C., Zhang J.: A Modified Cop-Kmeans Algorithm Based on Sequenced Cannot- Link Set, Proc. 6th International Conference on Rough Sets and Knowledge Technology (RSKT2011), LNAI 6954, Springer, 2011, 217-225.
[17] Dean J., Ghemawat S.: MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, 51(1), 2008, 107-113.
[18] Xuan, W.: Clustering in the Cloud: Clustering Algorithm Adoption to Hadoop Map/Reduce Framework, Technical Reports-Computer Science 19, 2010.
[19] Yang Y., Chen Z.: Parallelized Computing of Attribute Core Based on Rough Set Theory and MapReduce, Proc. 7th International Conference on Rough Sets and Knowledge Technology (RSKT2012), LNAI 7414, Springer, 2012, 167-172.
[20] Malay, K.: Clustering Large Databases in Distributed Environment, Proc. International Advanced Computing Conference, 2009, 351-358.
[21] Zhang, Y., Xiong, Z., Mao, J.: The Study of Parallel K-Means Algorithm, Proc. World Congress on Intelligent Control and Automation, 2006, 5868-5871.
[22] Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce, Proc. CloudCom, LNCS5931, 2009, 674-679.
[23] Li H., Wu G., et al.: K-Means Clustering with Bagging and MapReduce, Proc. 44th Hawaii International Conference on System Sciences, 2011.
[24] Lin C., Yang Y., Rutayisire T.: A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework, Proc. 6th International Conference on Intelligent Systems & Knowledge Engineering, 2011, 93-102.
[25] Haichao, H., Yong, C., Ruilian, Z.: A semi-supervised clustering algorithm based on must-link set, Proc. International conference on Advanced Data Mining & Applications, 2008, 492-499.
[26] Davidson, I., Ravi, S.S.: Identifying and Generating easy sets of constraints for clustering, Proc. American Association for Artificial Intelligence, 2006, 336- 341.
[27] West, D.B.: Introduction to Graph Theory, Prentice Hall, Inc., Englewood Cliffs, NJ, 2001.
[28] Wbite T.: Hadoop: The Definitive Guide, O’Reilly Media, Inc., Second Edition, 2011.
[29] Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.html
[30] Yang Y., Kamel M.: An Aggregated Clustering Approach Using Multi-Ant Colonies Algorithms. Pattern Recognition, 39(7), 2006, 1278-1289.
[31] Xu, X., Jager, J., Kriegel, H.P.: A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3, 1999, 263-290.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-073558d7-94f2-4c81-9c3c-0909e2fad63d