Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Kryszkiewicz, M.

Artykuł - szczegóły

Tytuł artykułu

Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Autorzy

Kryszkiewicz M.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The cosine and Tanimoto similarity measures are typically applied in the area of chemical informatics, bio-informatics, information retrieval, text and web mining as well as in very large databases for searching sufficiently similar vectors. In the case of large sparse high dimensional data sets such as text or Web data sets, one typically applies inverted indices for identification of candidates for sufficiently similar vectors to a given vector. In this article, we offer new theoretical results on how the knowledge about non-zero dimensions of real valued vectors can be used to reduce the number of candidates for vectors sufficiently cosine and Tanimoto similar to a given one. We illustrate and discuss the usefulness of our findings on a sample collection of documents represented by a set of a few thousand real valued vectors with more than ten thousand dimensions.

Słowa kluczowe

data mining high dimensional data sets information retrieval inverted indices similarity joins sparse data sets text mining cosine similarity Tanimoto similarity

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2013

Tom

Vol. 127, nr 1-4

Strony

307--323

Opis fizyczny

Bibliogr. 21 poz., tab.

Twórcy

autor

Kryszkiewicz M.

mkr@ii.pw.edu.pl

Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland

Bibliografia

[1] Arasu, A., Ganti, V, and Kaushik, R.: Efficient exact set-similarity joins. Proc. VLDB 2006, ACM (2006)
[2] Bayardo, R. J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. Proc. WWW 2007, ACM (2007) 131-140
[3] Chaudhuri S., Ganti V., Kaushik R.L.: A primitive operator for similarity joins in data cleaning, Proc. ICDE06, IEEE Computer Society (2006)
[4] Elkan, C: Using the Triangle Inequality to Accelerate k-Means. Proc. ICML03, Washington (2003) 147-153
[5] Kristensen, T. G.: Transforming Tanimoto queries on real valued vectors to range queries in Euclidian space. J. Mathematical Chemistry 48 (2) (2010) 287-289
[6] Kryszkiewicz, M.: Efficient Determination of Binary Non-Negative Vector Neighbors with Regard to Cosine Similarity. Proc. IEA/AIE 2012, LNCS (LNAI) 7345, Springer (2012) 48-57
[7] Kryszkiewicz, M.: The Triangle Inequality versus Projection onto a Dimension in Determining Cosine Similarity Neighborhoods of Non-negative Vectors. Proc. RSCTC 2012, LNCS (LNAI) 7413, Springer (2012) 229-236
[8] Kryszkiewicz, M.: Bounds on Lengths of Vectors Similar with Regard to the Tanimoto and Cosine Similarity. ICS Research Report 3, Institute of Computer Science, Warsaw University of Technology, Warsaw (2012)
[9] Kryszkiewicz, M.: Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance. Rough Sets and Intelligent Systems, Intelligent Systems Reference Library 43, Springer (2013) 323-345
[10] Kryszkiewicz, M.: Bounds on Lengths of Real Valued Vectors Similar with Regard to the Tanimoto Similarity. Proc. ACIIDS (1) 2013, LNCS (LNAI) 7802, Springer (2013) 445-454
[11] Kryszkiewicz, M.: On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number. Proc. FQAS 2013, LNCS (LNAI) 8132, Springer (2013) 531-542
[12] Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality. Proc. RSCTC 2010, LNCS (LNAI) 6086, Springer (2010) 60-69
[13] Kryszkiewicz, M., Lasek, P.: A Neighborhood-Based Clustering by Means of the Triangle Inequality. Proc. IDEAL 2010, LNCS 6283, Springer (2010) 284-291
[14] Moore, A. W., The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data. Proc. UAI, Stanford (2000) 397-405
[15] Salton, G., Wong, A., Yang, C. S.: A vector space model for automatic indexing. Communications of the ACM, 18(11) (1975) 613-620
[16] Samet, H.: Foundations of multidimensional and metric data structures. Morgan Kaufmann (2006)
[17] Uhlmann, J.: Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40 (4) (1991) 175-179
[18] Willett, P., Barnard, J. M., Downs, G. M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38 (6) (1998) 983-996
[19] Witten, I. H., Moffat, A., Bell, T. C. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999)
[20] Xiao C., Wang W., Lin X., Yu J. X.: Efficient similarity joins for near duplicate detection, Proc. WWW (2008) 131-140
[21] Zezula, P., Amato, G., Dohnal, V, Batko, M.: Similarity Search: The Metric Space Approach. Springer (2006)

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-28464033-ca63-4355-9be1-5a471aec6be6