On the Consistency of k-means++ algorithm

Kłopotek, Mieczysław A.

doi:10.3233/FI-2020-1909

Artykuł - szczegóły

Tytuł artykułu

On the Consistency of k-means++ algorithm

Autorzy

Kłopotek Mieczysław A.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2020-1909

Warianty tytułu

Języki publikacji

Abstrakty

We prove in this paper that the expected value of the objective function of the k-means++ algorithm for samples converges to population expected value. As k-means++, for samples, provides with constant factor approximation for k-means objectives, such an approximation can be achieved for the population with increase of the sample size. This result is of potential practical relevance when one is considering using subsampling when clustering large data sets (large data bases).

Słowa kluczowe

consistency constant factor approximation data mining expected value k-means cost function k-means++ mathematical foundations

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2020

Tom

Vol. 172, nr 4

Strony

361--377

Opis fizyczny

Bibliogr. 12 poz., rys.

Twórcy

autor

Kłopotek Mieczysław A.

klopotek@ipipan.waw.pl

Institute of Computer Science of the Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland

Bibliografia

[1] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proc. Fifth Berkeley Symp. on Math. Statist. and Prob., volume 1. Univ. of Calif. Press, 1967 pp. 281-297. URL https://projecteuclid.org/euclid.bsmsp/1200512992.
[2] Wierzchoń S, Kłopotek M. Modern Clustering Algorithms. Springer Verlag Series: Studies in Big Data 34. Springer Verlag, 2018.
[3] Arthur D, Vassilvitskii S. k-means++: the advantages of careful seeding. In: Bansal N, Pruhs K, Stein C (eds.), Proc. of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007. SIAM, New Orleans, Louisiana, USA, 2007 pp. 1027-1035. ISBN: 978-0-898716-24-5.
[4] Bejarano J, Bose K, Brannan T, Thomas A, Adragni K, Neerchal NK. Sampling Within k-Means Algorithm to Cluster Large Datasets. Technical Report HPCF-2011-12, High Performance Computing Facility, University of Maryland, Baltimore County, 2011. doi:10.2172/1025410.
[5] Pollard D. Strong Consistency of K-Means Clustering. Ann. Statist., 1981. 9(1):135-140. URL https://www.jstor.org/stable/2240876.
[6] Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Piatko CD, Silverman R, Wu AY. A Local Search Approximation Algorithm for k-Means Clustering. In: 18th Annual ACM Symposium on Computational Geometry (SoCG’02), June 2002, Barcelona, Spain. 2002 pp. 10-18. doi:10.1145/513400.513402.
[7] Tang C, Monteleoni C. Scalable constant k-means approximation via heuristics on well-clusterable data. In: Poster Session of Learning faster from easy data II conference, Montreal, Canada.
[8] Kumar A, Sabharwal Y, Sen S. A simple linear time approximation algorithm for geometric k-means problem in any dimension. In: FOCS 2004. 2004 pp. 454-462. doi:10.1109/FOCS.2004.7.
[9] Balcan MF, Blum A, Gupta A. Approximate Clustering Without the Approximation. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’09. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2009 pp. 1068-1077. URL http://dl.acm.org/citation.cfm?id=1496770.1496886.
[10] Terada Y. Strong consistency of reduced K-means clustering. Scand. J. Stat., 2014. 41(4):913-931. URL https://doi.org/10.1111/sjos.12074.
[11] Mahajan M, Nimbhorkar P, Varadarajan K. The Planar k-Means Problem is NP-Hard. In: Proceedings of the 3rd International Workshop on Algorithms and Computation, WALCOM ’09. Springer-Verlag, Berlin, Heidelberg. 2009 pp. 274-285, ISBN: 978-3-642-00201-4.
[12] Bachem O, Lucic M, Hassani SH, Krause A. Approximate K-means++ in Sublinear Time. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16. AAAI Press, 2016 pp. 1459-1467.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-bdef7c64-4fcb-426a-be57-684f7af2f589