Center-Based Indexing in Vector and Metric Spaces

Wojna, A.

Artykuł - szczegóły

Tytuł artykułu

Center-Based Indexing in Vector and Metric Spaces

Autorzy

Wojna A.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The paper addresses the problem of indexing data for k nearest neighbors (k-nn) search. Given a collection of data objects and a similarity measure the searching goal is to find quickly the k most similar objects to a given query object. We present a top-down indexing method that employs a widely used scheme of indexing algorithms. It starts with the whole set of objects at the root of an indexing tree and iteratively splits data at each level of indexing hierarchy. In the paper two different data models are considered. In the first, objects are represented by vectors from a multi-dimensional vector space. The second, more general, is based on an assumption that objects satisfy only the axioms of a metric space. We propose an iterative k-means algorithm for tree node splitting in case of a vector space and an iterative k-approximate-centers algorithm in case when only a metric space is provided. The experiments show that the iterative k-means splitting procedure accelerates significantly k-nn searching over the one-step procedure used in other indexing structures such as GNAT, SS-tree and M-tree and that the relevant representation of a tree node is an important issue for the performance of the search process. We also combine different search pruning criteria used in BST, GHT nad GNAT structures into one and show that such a combination outperforms significantly each single pruning criterion. The experiments are performed for benchmark data sets of the size up to several hundreds of thousands of objects. The indexing tree with the k-means splitting procedure and the combined search criteria is particularly effective for the largest tested data sets for which this tree accelerates searching up to several thousands times

Słowa kluczowe

similarity k-nearest neighbors data sets indexing metric spaces

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2003

Tom

Vol. 56, nr 3

Strony

285--310

Opis fizyczny

wykr., bibliogr. 40 poz.

Twórcy

autor

Wojna A.

Institute of Informatics Warsaw University ul. Banacha 2 02-097 Warsaw, Poland, woojna@mimuw.edu.pl

Bibliografia

[1] Aggarwal, Ch.C., Hinneburg, A., Keim, D.A. (2001), On the surprising behaviour of distance metrics in high dimensional space. Proceedings of the Eighth Internatinal Conference on Database Theory (ICDT-2001), London, UK, 420-434.
[2] Aha, D.W. (1998). The omnipresence of case-based reasoning in science and application. Knowledge-Based Systems, 11 (5-6), 261-273.
[3] Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B. (1990). The -tree: an efficient and robust access method for points and rectangles. Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, 322-331.
[4] Bentley, J.L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509-517 .
[5] Berchtold, S., Keim, D., Kriegel, H.P. (1996). The X-tree: an index structure for high dimensional data. Proceedings of the Twenty Second International Conference on Very Large Databases (VLDB-1996), 28-39.
[6] Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999). When Is “Nearest Neighbor” Meaningful? Proceedings of the Seventh International Conference on Database Theory (ICDT-1999), Jerusalem, Israel, 217-235.
[7] Biberman, Y. (1994). A context similarity measure. Proceedings of the Ninth European Conference on Machine Learning (ECML-1994), Catania, Italy, 49-63.
[8] Blake, C.L., Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html], Department of Information and Computer Science, University of California, Irvine, CA.
[9] Brin, S. (1995). Near neighbor search in large metric spaces. Proceedings of the Twenty First International Conference on Very Large Databases (VLDB-1995), 574-584.
[10] Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L. (1999). Saerching in metric spaces. Technical Report TR/DCC-99-3, Department of Computer Science, University of Chile.
[11] Ciaccia, P., Patella, M., Zezula, P. (1997). M-tree: an efficient access method for similarity search in metric spaces. Proceedings of the Twenty Third International Conference on Very Large Databases (VLDB-1997), 426-435.
[12] Cost, S. and Salzberg, S. (1993). A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10, 57-78.
[13] Cover, T.M. and Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21-27.
[14] Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168.
[15] Duda, R.O. and Hart, P.E. (1973). Pattern classification and scene analysis. Wiley, New York, NY.
[16] Finkel, R., Bentley, J. (1974). Quad-trees: A data structure for retrieval and composite keys. ACTA Informatica 4(1), 1-9.
[17] Fukunaga, K., Narendra, P.M. (1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, 24(7), 750-753.
[18] Góra, G., Wojna, A. (2002). RIONA: a classifier combining rule induction and k-nn method with automated selection of optimal neighbourhood. Proceedings of the Thirteenth European Conference on Machine Learning (ECML-2002), Helsinki, Finland, Lecture Notes in Artificial Intelligence, 2430, Springer-Verlag 2002, 111-123.
[19] Góra, G., Wojna, A. (2002). RIONA: a new classification system combining rule induction and instancebased learning. Fundamenta Informaticae, 51(4), 369-390.
[20] Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, Boston, MA, 47-57.
[21] Kalantari, I. and McDonald, G. (1983). A data structure and an algorithm for the nearest point problem. IEEE Transactions on Software Engineering, 9 (5), 631-634.
[22] Katayama, N., Satoh, S. (1997). The SR-tree: an index structure for high dimensional nearest neighbor queries. Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, 369-380.
[23] Lin, K.I., Jagadish, H.V., Faloustos, C. (1994). The TV-tree: an index structure for high dimensional data. VLDB Journal, 3(4), 517-542.
[24] Mitchell T.M. (1997). Machine learning. McGraw-Hill, Portland.
[25] Nievergelt, J., Hinterberger, H., Sevcik, K. (1984). The grid file: an adaptable symmetric multikey file structure. ACM Transactions on Database Systems, 9(1), 38-71.
[26] Robinson, J. (1981). The k-d-b-tree: A search structure for large multi-dimensional dynamic indexes. Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data, ACM, New York, 10-18.
[27] Salzberg, S. (1991). A nearest hyperrectangle learning method. Machine Learning, 2, 229-246.
[28] Savaresi, S.M., Boley D.L. (2001), On the performance of bisecting K-means and PDDP. Proceedings of the First SIAM International Conference on Data Mining (ICDM-2001), Chicago, USA, 1-14.
[29] Sellis, T., Roussopoulos, N., Faloustos, C. (1987). The R+-tree: A dynamic index for multi-dimensional objects. Proceedings of the Thirteenth International Conference on Very Large Databases (VLDB-1987), 574-584.
[30] Skowron, A. et al. (2003). Rough Set Exploration System [http://logic.mimuw.edu.pl/~rses], Institute of Mathematics, Warsaw University, Poland.
[31] Stanfill, C. and Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29, 1213-1228.
[32] Uhlmann, J. (1991). Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4), 175-179.
[33] Veloso, M. (1994). Planning and learning by analogical reasoning. Springer-Verlag.
[34] Ward, J.Jr (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236-244.
[35] Weber, R., Schek, H.J., Blott, S. (1998). A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proceedings of the Tenty Fourth International Conference on Very Large Databases (VLDB-1998), 194-205.
[36] Wettschereck, D. (1994). A study of Distance-Based Machine Learning Algorithms. Doctor of Philosophy dissertation in Computer Science, Oregon State University.
[37] Wettschereck, D., Aha, D.W., Mohri, T. (1997). A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms. Artificial Intelligence Review 11, 273-314.
[38] White, D.A., Jain R. (1996). Similarity indexing with the SS-tree. Proceedings of the Twelve International Conference on Data Engineering (ICDE-1996), New Orleans, USA, 516-523.
[39] Wojna, A. (2003). Center-based indexing for nearest neighbors search. Proceedings of the Third IEEE International Conference on Data Mining (ICDM-2003), Melbourne, Florida, USA.
[40] Yianilos, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Proceedings of the Fourth Annual ACM/SIGACT-SIAM Symposium on Discrete Algorithms, Austin, Texas, 311-321.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS2-0004-0136