Data integration for earthquake disaster using real-world data

Tian, Chuanzhao; Li, Guoqing

doi:10.1007%2Fs11600-019-00381-4

Artykuł - szczegóły

Tytuł artykułu

Data integration for earthquake disaster using real-world data

Autorzy

Tian Chuanzhao , Li Guoqing

Wybrane pełne teksty z tego czasopisma

https://www.springer.com/journal/11600

Identyfikatory

DOI

10.1007%2Fs11600-019-00381-4

Warianty tytułu

Języki publikacji

Abstrakty

The purpose of entity resolution (ER) is to identify records that refer to the same real-world entity from diferent sources. Most traditional ER studies identify records based on string-based data, so the ER problem relies mostly on string comparison techniques. There is little research on numeric-based data. Traditional ER approaches are widely used in many domains, such as papers, gene sequencing and restaurants, but they have not been used in an earthquake disaster. In this paper, earthquake disaster event information that was collected from diferent websites is denoted with numeric data. To solve the problem of ER in numeric data, we use the following methods to conduct experiments. First, we treat numbers as strings and use string-based approaches. Second, we use the Euclidean distance to measure the diference between two records. Third, we combine the above two strategies and use a comprehensive approach to measure the distance between the two records. We experimentally evaluate our methods on real datasets that represent earthquake disaster event information. The experimental results show that a comprehensive approach can achieve high performance.

Słowa kluczowe

data integration earthquake numerical data entity resolution

integracja danych trzęsienie ziemi dane numeryczne

Wydawca

Instytut Geofizyki PAN
Springer

Czasopismo

Acta Geophysica

Rocznik

2020

Tom

Vol. 68, no. 1

Strony

19--28

Opis fizyczny

Bibliogr. 40 poz.

Twórcy

autor

Tian Chuanzhao

Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100094, China
University of Chinese Academy of Sciences, Beijing 100049, China

autor

Li Guoqing

ligq@radi.ac.cn

Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100094, China

Bibliografia

1. Ayat N, Afsarmanesh H, Akbarinia R, Valduriez P (2012) An uncertain data integration system. In: On the Move to meaningful internet systems: Otm
2. Ayat N, Akbarinia R, Afsarmanesh H, Valduriez P (2014) Entity resolution for probabilistic data. Inf Sci 277:492–511
3. Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82
4. Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):762–772
5. Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1, 1992, Proceedings. Springer
6. Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
7. Christen P, Goiser K (2007) Quality and complexity measures for data linkage and deduplication. Complexity 43:127–151
8. Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prev 10(3):186–191
9. Du MW, Chang SC (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633
10. Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
11. Fan X (2016) GEOFON data center. Recent Dev World Seismol 452(8):33–41
12. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
13. Galil Z, Giancarlo R (1988) Data structures and algorithms for approximate string matching. J Complex 4(1):33–72
14. Geller RJ (2007) Earthquake prediction: a critical review. Geophys J Int 131(3):425–450
15. Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
16. Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293
17. Jaro MA (1980) UNIMATCH, a record linkage system: users manual. Bureau of the Census
18. Kelman CW, Bass AJ, Holman CDJ (2010) Research use of linked health data — a best practice protocol. Aust N Z J Publ Health 26(2):251–255
19. Khan B, Rauf A, Shah SH, Khusro S (2011) Identification and removal of duplicated records. World Appl Sci J 13(5):1178–1184
20. Knuth DE, Morris JH Jr, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput 6(2):323–350
21. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493
22. Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Thirtieth international conference on very large data bases
23. Lee S, Lee J, Hwang SW (2014) Efficient entity matching using materialized lists. Inf Sci 261:170–184
24. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol 10, No 8, pp 707–710
25. Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27(1):250–263
26. Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. J Data Inf Qual 2(1):1–33
27. Miller FP, Vandome AF, Mcbrewster J (1980) Approximate string matching. ACM Comput Surv 12(4):381–402
28. Monge AE (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20
29. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
30. Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687
31. Pinheiro JC, Sun DX (1998). Methods for linking and mining massive heterogeneous databases. In: Proceedings of the fourth international conference on knowledge discovery and data mining, August. AAAI Press, pp 309–313
32. Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532
33. Steorts RC, Ventura SL, Sadinle M, Fienberg SE (2014) A comparison of blocking methods for record linkage. In: International conference on privacy in statistical databases. Springer, Cham
34. Sun CC, Shen DR, Kou Y, Nie TZ, Yu G (2016) Entity resolution oriented clustering algorithm. J Softw 27(9):2303–2319 (in Chinese)
35. Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Algorithms-esa 95, third European symposium, Corfu, Greece, September. DBLP
36. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211
37. Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20(3):367–387
38. Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7):531–550
39. Winkler WE (2006) Overview of record linkage and current research directions. In: Bureau of the Census
40. Zhu B, Suo M, Chen Y, Zhang Z, Li S (2018) Mixed H∞ and passivity control for a class of stochastic nonlinear sampled-data systems. J Frankl Inst 355(7):3310–3329

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021)

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-377c141f-938f-48c2-95c8-9742b47e00c9