Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl

PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2020 | Vol. 172, nr 3 | 299--325
Tytuł artykułu

Time and Space Efficient Large Scale Link Discovery using String Similarities

Wybrane pełne teksty z tego czasopisma
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This paper proposes and evaluates time and space efficient methods for matching entities in large data sets based on effectively pruning the candidate pairs to be matched, using edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target data set, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: the first method clusters entities and exploits the triangle inequality using the string similarity metric, in conjunction to the substring matching filtering rule. The second method uses only the substring matching rule, while the third method uses the substring matching rule in conjunction to the character frequency matching filtering rule. Evaluation results show the pruning power of the different filtering methods used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state of the art frameworks for large scale link discovery.
Wydawca

Rocznik
Strony
299--325
Opis fizyczny
Bibliogr. 23 poz., rys., tab., wykr.
Twórcy
  • Digital Systems Department, University of Piraeus, Piraeus, Greece, akar@unipi.gr
  • Digital Systems Department, University of Piraeus, Piraeus, Greece, georgev@unipi.gr
Bibliografia
  • [1] Nentwig M, Hartung M, Ngonga Ngomo AC, Rahm E. A survey of current link discovery frameworks. Semantic Web, 2017. 8(3):419-436.
  • [2] Papadakis G, Ioannou E, Palpanas T, Niederee C, Nejdl W. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. IEEE Trans. on Knowl. and Data Eng., 2013. 25(12):2665-2682. doi:10.1109/TKDE.2012.150. URL https://doi.org/10.1109/TKDE.2012.150.
  • [3] Sun Y, Ma L, Wang S. A comparative evaluation of string similarity metrics for ontology alignment. JOURNAL OF INFORMATION &COMPUTATIONAL SCIENCE, 2015. 12(3):957-964.
  • [4] Ngomo ACN. On link discovery using a hybrid approach. Journal on Data Semantics, 2012. 1(4):203-217.
  • [5] Isele R, Jentzsch A, Bizer C. Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: WebDB. 2011.
  • [6] McCrae JP. The Linked Open Data Cloud. URL https://lod-cloud.net/.
  • [7] Winkler WE. The state of record linkage and current research problems. U.S. Bureau of the Census, 1999.
  • [8] Winkler WE. Overview of record linkage and current research directions. Technical report, BUREAU OF THE CENSUS, 2006.
  • [9] Arasu A, Ganti V, Kaushik R. Efficient Exact Set-similarity Joins. In: Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06. VLDB Endowment, 2006 pp. 918-929. URL http://dl.acm.org/citation.cfm?id=1182635.1164206.
  • [10] Chiappe Laverde A, Segovia Cifuentes Y, Rincón Rodríguez HY. Toward an instructional design model based on learning objects. Educational Technology Research and Development, 2007. 55(6):671-681. doi:10.1007/s11423-007-9059-0. URL https://doi.org/10.1007/s11423-007-9059-0.
  • [11] Xiao C, Wang W, Lin X, Yu JX. Efficient Similarity Joins for Near Duplicate Detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08. ACM, New York, NY, USA. ISBN 978-1-60558-085-2, 2008 pp. 131-140. doi:10.1145/1367497.1367516. URL http://doi.acm.org/10.1145/1367497.1367516.
  • [12] Jiang Y, Li G, Feng J, Li WS. String Similarity Joins: An Experimental Evaluation. Proc. VLDB Endow., 2014. 7(8):625-636. doi:10.14778/2732296.2732299. URL http://dx.doi.org/10.14778/2732296.2732299.
  • [13] Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11. ACM, New York, NY, USA. ISBN 978-1-4503-0661-4, 2011 pp. 1033-1044. doi:10.1145/1989323.1989431. URL http://doi.acm.org/10.1145/1989323.1989431.
  • [14] Li G, Deng D, Wang J, Feng J. Pass-join: A Partition-based Method for Similarity Joins. Proc. VLDB Endow., 2011. 5(3):253-264. doi:10.14778/2078331.2078340. URL http://dx.doi.org/10.14778/2078331.2078340.
  • [15] Bocek T, Hunt E, Stiller B, Hecht F. Fast similarity search in large dictionaries. University, 2007.
  • [16] Bayardo RJ, Ma Y, Srikant R. Scaling Up All Pairs Similarity Search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07. ACM, New York, NY, USA. ISBN 978-1-59593-654-7, 2007 pp. 131-140. doi:10.1145/1242572.1242591. URL http://doi.acm.org/10.1145/1242572.1242591.
  • [17] Feng J, Wang J, Li G. Trie-join: A Trie-based Method for Efficient String Similarity Joins. The VLDB Journal, 2012. 21(4):437-461. doi:10.1007/s00778-011-0252-8. URL http://dx.doi.org/10.1007/s00778-011-0252-8.
  • [18] Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady., 1966. 10(8):707-710.
  • [19] Ngomo ACN, Auer S. Limes-a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI. 2011 pp. 2312-2317.
  • [20] Xiao C, Wang W, Lin X. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. Proc. VLDB Endow., 2008. 1(1):933-944. doi:10.14778/1453856.1453957. URL http://dx.doi.org/10.14778/1453856.1453957.
  • [21] Wagner RA, Fischer MJ. The String-to-String Correction Problem. J. ACM, 1974. 21(1):168-173. doi:10.1145/321796.321811. URL http://doi.acm.org/10.1145/321796.321811.
  • [22] Ngomo ACN, Kolb L, Heino N, Hartung M, Auer S, Rahm E. When to Reach for the Cloud: Using Parallel Hardware for Link Discovery. In: ESWC. 2013 pp. 275-289.
  • [23] Kolb L, Thor A, Rahm E. Dedoop: Efficient Deduplication with Hadoop. Proc. VLDB Endow., 2012. 5(12):1878-1881. doi:10.14778/2367502.2367527. URL http://dx.doi.org/10.14778/2367502.2367527.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu
"Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja
sportu (2020).
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.baztech-cf588548-4f60-4afb-8a07-f1738ddc43e6
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.