Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Data integration for earthquake disaster using real-world data

Warianty tytułu
Języki publikacji
The purpose of entity resolution (ER) is to identify records that refer to the same real-world entity from diferent sources. Most traditional ER studies identify records based on string-based data, so the ER problem relies mostly on string comparison techniques. There is little research on numeric-based data. Traditional ER approaches are widely used in many domains, such as papers, gene sequencing and restaurants, but they have not been used in an earthquake disaster. In this paper, earthquake disaster event information that was collected from diferent websites is denoted with numeric data. To solve the problem of ER in numeric data, we use the following methods to conduct experiments. First, we treat numbers as strings and use string-based approaches. Second, we use the Euclidean distance to measure the diference between two records. Third, we combine the above two strategies and use a comprehensive approach to measure the distance between the two records. We experimentally evaluate our methods on real datasets that represent earthquake disaster event information. The experimental results show that a comprehensive approach can achieve high performance.
Opis fizyczny
Bibliogr. 40 poz.
  • Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100094, China
  • University of Chinese Academy of Sciences, Beijing 100049, China
  • Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100094, China
  • 1. Ayat N, Afsarmanesh H, Akbarinia R, Valduriez P (2012) An uncertain data integration system. In: On the Move to meaningful internet systems: Otm
  • 2. Ayat N, Akbarinia R, Afsarmanesh H, Valduriez P (2014) Entity resolution for probabilistic data. Inf Sci 277:492–511
  • 3. Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82
  • 4. Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):762–772
  • 5. Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1, 1992, Proceedings. Springer
  • 6. Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
  • 7. Christen P, Goiser K (2007) Quality and complexity measures for data linkage and deduplication. Complexity 43:127–151
  • 8. Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prev 10(3):186–191
  • 9. Du MW, Chang SC (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633
  • 10. Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
  • 11. Fan X (2016) GEOFON data center. Recent Dev World Seismol 452(8):33–41
  • 12. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
  • 13. Galil Z, Giancarlo R (1988) Data structures and algorithms for approximate string matching. J Complex 4(1):33–72
  • 14. Geller RJ (2007) Earthquake prediction: a critical review. Geophys J Int 131(3):425–450
  • 15. Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
  • 16. Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293
  • 17. Jaro MA (1980) UNIMATCH, a record linkage system: users manual. Bureau of the Census
  • 18. Kelman CW, Bass AJ, Holman CDJ (2010) Research use of linked health data — a best practice protocol. Aust N Z J Publ Health 26(2):251–255
  • 19. Khan B, Rauf A, Shah SH, Khusro S (2011) Identification and removal of duplicated records. World Appl Sci J 13(5):1178–1184
  • 20. Knuth DE, Morris JH Jr, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput 6(2):323–350
  • 21. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493
  • 22. Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Thirtieth international conference on very large data bases
  • 23. Lee S, Lee J, Hwang SW (2014) Efficient entity matching using materialized lists. Inf Sci 261:170–184
  • 24. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol 10, No 8, pp 707–710
  • 25. Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27(1):250–263
  • 26. Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. J Data Inf Qual 2(1):1–33
  • 27. Miller FP, Vandome AF, Mcbrewster J (1980) Approximate string matching. ACM Comput Surv 12(4):381–402
  • 28. Monge AE (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20
  • 29. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
  • 30. Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687
  • 31. Pinheiro JC, Sun DX (1998). Methods for linking and mining massive heterogeneous databases. In: Proceedings of the fourth international conference on knowledge discovery and data mining, August. AAAI Press, pp 309–313
  • 32. Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532
  • 33. Steorts RC, Ventura SL, Sadinle M, Fienberg SE (2014) A comparison of blocking methods for record linkage. In: International conference on privacy in statistical databases. Springer, Cham
  • 34. Sun CC, Shen DR, Kou Y, Nie TZ, Yu G (2016) Entity resolution oriented clustering algorithm. J Softw 27(9):2303–2319 (in Chinese)
  • 35. Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Algorithms-esa 95, third European symposium, Corfu, Greece, September. DBLP
  • 36. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211
  • 37. Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20(3):367–387
  • 38. Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7):531–550
  • 39. Winkler WE (2006) Overview of record linkage and current research directions. In: Bureau of the Census
  • 40. Zhu B, Suo M, Chen Y, Zhang Z, Li S (2018) Mixed H∞ and passivity control for a class of stochastic nonlinear sampled-data systems. J Frankl Inst 355(7):3310–3329
Typ dokumentu
Identyfikator YADDA
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.