PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Autorzy
Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The cosine and Tanimoto similarity measures are typically applied in the area of chemical informatics, bio-informatics, information retrieval, text and web mining as well as in very large databases for searching sufficiently similar vectors. In the case of large sparse high dimensional data sets such as text or Web data sets, one typically applies inverted indices for identification of candidates for sufficiently similar vectors to a given vector. In this article, we offer new theoretical results on how the knowledge about non-zero dimensions of real valued vectors can be used to reduce the number of candidates for vectors sufficiently cosine and Tanimoto similar to a given one. We illustrate and discuss the usefulness of our findings on a sample collection of documents represented by a set of a few thousand real valued vectors with more than ten thousand dimensions.
Wydawca
Rocznik
Strony
307--323
Opis fizyczny
Bibliogr. 21 poz., tab.
Twórcy
  • Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland
Bibliografia
  • [1] Arasu, A., Ganti, V, and Kaushik, R.: Efficient exact set-similarity joins. Proc. VLDB 2006, ACM (2006)
  • [2] Bayardo, R. J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. Proc. WWW 2007, ACM (2007) 131-140
  • [3] Chaudhuri S., Ganti V., Kaushik R.L.: A primitive operator for similarity joins in data cleaning, Proc. ICDE06, IEEE Computer Society (2006)
  • [4] Elkan, C: Using the Triangle Inequality to Accelerate k-Means. Proc. ICML03, Washington (2003) 147-153
  • [5] Kristensen, T. G.: Transforming Tanimoto queries on real valued vectors to range queries in Euclidian space. J. Mathematical Chemistry 48 (2) (2010) 287-289
  • [6] Kryszkiewicz, M.: Efficient Determination of Binary Non-Negative Vector Neighbors with Regard to Cosine Similarity. Proc. IEA/AIE 2012, LNCS (LNAI) 7345, Springer (2012) 48-57
  • [7] Kryszkiewicz, M.: The Triangle Inequality versus Projection onto a Dimension in Determining Cosine Similarity Neighborhoods of Non-negative Vectors. Proc. RSCTC 2012, LNCS (LNAI) 7413, Springer (2012) 229-236
  • [8] Kryszkiewicz, M.: Bounds on Lengths of Vectors Similar with Regard to the Tanimoto and Cosine Similarity. ICS Research Report 3, Institute of Computer Science, Warsaw University of Technology, Warsaw (2012)
  • [9] Kryszkiewicz, M.: Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance. Rough Sets and Intelligent Systems, Intelligent Systems Reference Library 43, Springer (2013) 323-345
  • [10] Kryszkiewicz, M.: Bounds on Lengths of Real Valued Vectors Similar with Regard to the Tanimoto Similarity. Proc. ACIIDS (1) 2013, LNCS (LNAI) 7802, Springer (2013) 445-454
  • [11] Kryszkiewicz, M.: On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number. Proc. FQAS 2013, LNCS (LNAI) 8132, Springer (2013) 531-542
  • [12] Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality. Proc. RSCTC 2010, LNCS (LNAI) 6086, Springer (2010) 60-69
  • [13] Kryszkiewicz, M., Lasek, P.: A Neighborhood-Based Clustering by Means of the Triangle Inequality. Proc. IDEAL 2010, LNCS 6283, Springer (2010) 284-291
  • [14] Moore, A. W., The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data. Proc. UAI, Stanford (2000) 397-405
  • [15] Salton, G., Wong, A., Yang, C. S.: A vector space model for automatic indexing. Communications of the ACM, 18(11) (1975) 613-620
  • [16] Samet, H.: Foundations of multidimensional and metric data structures. Morgan Kaufmann (2006)
  • [17] Uhlmann, J.: Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40 (4) (1991) 175-179
  • [18] Willett, P., Barnard, J. M., Downs, G. M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38 (6) (1998) 983-996
  • [19] Witten, I. H., Moffat, A., Bell, T. C. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999)
  • [20] Xiao C., Wang W., Lin X., Yu J. X.: Efficient similarity joins for near duplicate detection, Proc. WWW (2008) 131-140
  • [21] Zezula, P., Amato, G., Dohnal, V, Batko, M.: Similarity Search: The Metric Space Approach. Springer (2006)
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-28464033-ca63-4355-9be1-5a471aec6be6
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.