Time and Space Efficient Large Scale Link Discovery using String Similarities

Karampelas, Andreas; Vouros, George A.

doi:10.3233/FI-2020-1906

Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl

Artykuł - szczegóły

Czasopismo

Fundamenta Informaticae

2020 | Vol. 172, nr 3 | 299--325

Tytuł artykułu

Time and Space Efficient Large Scale Link Discovery using String Similarities

Autorzy

Karampelas, Andreas , Vouros, George A.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Warianty tytułu

Języki publikacji

Abstrakty

This paper proposes and evaluates time and space efficient methods for matching entities in large data sets based on effectively pruning the candidate pairs to be matched, using edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target data set, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: the first method clusters entities and exploits the triangle inequality using the string similarity metric, in conjunction to the substring matching filtering rule. The second method uses only the substring matching rule, while the third method uses the substring matching rule in conjunction to the character frequency matching filtering rule. Evaluation results show the pruning power of the different filtering methods used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state of the art frameworks for large scale link discovery.

Słowa kluczowe

edit distance filtering rule frequency matching Link discovery string matching

Wydawca

Czasopismo

Fundamenta Informaticae

Rocznik

2020

Tom

Vol. 172, nr 3

Strony

299--325

Opis fizyczny

Bibliogr. 23 poz., rys., tab., wykr.

Twórcy

autor

Karampelas, Andreas

Digital Systems Department, University of Piraeus, Piraeus, Greece, akar@unipi.gr

autor

Vouros, George A.

Digital Systems Department, University of Piraeus, Piraeus, Greece, georgev@unipi.gr

Bibliografia

[1] Nentwig M, Hartung M, Ngonga Ngomo AC, Rahm E. A survey of current link discovery frameworks. Semantic Web, 2017. 8(3):419-436.
[2] Papadakis G, Ioannou E, Palpanas T, Niederee C, Nejdl W. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. IEEE Trans. on Knowl. and Data Eng., 2013. 25(12):2665-2682. doi:10.1109/TKDE.2012.150. URL https://doi.org/10.1109/TKDE.2012.150.
[3] Sun Y, Ma L, Wang S. A comparative evaluation of string similarity metrics for ontology alignment. JOURNAL OF INFORMATION &COMPUTATIONAL SCIENCE, 2015. 12(3):957-964.
[4] Ngomo ACN. On link discovery using a hybrid approach. Journal on Data Semantics, 2012. 1(4):203-217.
[5] Isele R, Jentzsch A, Bizer C. Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: WebDB. 2011.
[6] McCrae JP. The Linked Open Data Cloud. URL https://lod-cloud.net/.
[7] Winkler WE. The state of record linkage and current research problems. U.S. Bureau of the Census, 1999.
[8] Winkler WE. Overview of record linkage and current research directions. Technical report, BUREAU OF THE CENSUS, 2006.
[9] Arasu A, Ganti V, Kaushik R. Efficient Exact Set-similarity Joins. In: Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06. VLDB Endowment, 2006 pp. 918-929. URL http://dl.acm.org/citation.cfm?id=1182635.1164206.
[10] Chiappe Laverde A, Segovia Cifuentes Y, Rincón Rodríguez HY. Toward an instructional design model based on learning objects. Educational Technology Research and Development, 2007. 55(6):671-681. doi:10.1007/s11423-007-9059-0. URL https://doi.org/10.1007/s11423-007-9059-0.
[11] Xiao C, Wang W, Lin X, Yu JX. Efficient Similarity Joins for Near Duplicate Detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08. ACM, New York, NY, USA. ISBN 978-1-60558-085-2, 2008 pp. 131-140. doi:10.1145/1367497.1367516. URL http://doi.acm.org/10.1145/1367497.1367516.
[12] Jiang Y, Li G, Feng J, Li WS. String Similarity Joins: An Experimental Evaluation. Proc. VLDB Endow., 2014. 7(8):625-636. doi:10.14778/2732296.2732299. URL http://dx.doi.org/10.14778/2732296.2732299.
[13] Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11. ACM, New York, NY, USA. ISBN 978-1-4503-0661-4, 2011 pp. 1033-1044. doi:10.1145/1989323.1989431. URL http://doi.acm.org/10.1145/1989323.1989431.
[14] Li G, Deng D, Wang J, Feng J. Pass-join: A Partition-based Method for Similarity Joins. Proc. VLDB Endow., 2011. 5(3):253-264. doi:10.14778/2078331.2078340. URL http://dx.doi.org/10.14778/2078331.2078340.
[15] Bocek T, Hunt E, Stiller B, Hecht F. Fast similarity search in large dictionaries. University, 2007.
[16] Bayardo RJ, Ma Y, Srikant R. Scaling Up All Pairs Similarity Search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07. ACM, New York, NY, USA. ISBN 978-1-59593-654-7, 2007 pp. 131-140. doi:10.1145/1242572.1242591. URL http://doi.acm.org/10.1145/1242572.1242591.
[17] Feng J, Wang J, Li G. Trie-join: A Trie-based Method for Efficient String Similarity Joins. The VLDB Journal, 2012. 21(4):437-461. doi:10.1007/s00778-011-0252-8. URL http://dx.doi.org/10.1007/s00778-011-0252-8.
[18] Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady., 1966. 10(8):707-710.
[19] Ngomo ACN, Auer S. Limes-a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI. 2011 pp. 2312-2317.
[20] Xiao C, Wang W, Lin X. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. Proc. VLDB Endow., 2008. 1(1):933-944. doi:10.14778/1453856.1453957. URL http://dx.doi.org/10.14778/1453856.1453957.
[21] Wagner RA, Fischer MJ. The String-to-String Correction Problem. J. ACM, 1974. 21(1):168-173. doi:10.1145/321796.321811. URL http://doi.acm.org/10.1145/321796.321811.
[22] Ngomo ACN, Kolb L, Heino N, Hartung M, Auer S, Rahm E. When to Reach for the Cloud: Using Parallel Hardware for Link Discovery. In: ESWC. 2013 pp. 275-289.
[23] Kolb L, Thor A, Rahm E. Dedoop: Efficient Deduplication with Hadoop. Proc. VLDB Endow., 2012. 5(12):1878-1881. doi:10.14778/2367502.2367527. URL http://dx.doi.org/10.14778/2367502.2367527.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu
"Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja
sportu (2020).

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.3233/FI-2020-1906

Identyfikator YADDA

bwmeta1.element.baztech-cf588548-4f60-4afb-8a07-f1738ddc43e6