Wykrywanie duplikatów w danych XML

Piłka, T.; Pankowski, T.

Artykuł - szczegóły

Tytuł artykułu

Wykrywanie duplikatów w danych XML

Autorzy

Piłka T. , Pankowski T.

Identyfikatory

Warianty tytułu

Detection duplicates in XML data

Języki publikacji

Abstrakty

W ostatnich latach XML stał się jednym z wiodących formatów publikowania danych w sieci WWW. Współczesne systemy integracji danych wymagają nie tylko mechanizmów wymiany i transformacji danych, ale także oczekują, że otrzymywane dane wynikowe nie będą zawierały zbędnych elementów (duplikatów). Duplikaty pojawiają się, gdy integrowane są dane pochodzące z różnorodnych źródeł. Wówczas te same obiekty rzeczywiste mogą być opisywane w różny sposób w różnych źródłach, dodatkowo opisy te często nie są identyczne, ale są podobne w sensie pewnych relacji podobieństwa. Dotychczas rozwijane były metody wykrywania duplikatów w relacyjnych bazach danych, ale w przypadku danych XML problem ten jest dużo trudniejszy ze względu na nieregularną strukturę elementów oraz na ich hierarchiczną organizację. W pracy omawiamy metody identyfikowania duplikatów w dokumentach XML, w których drzewa XML przedstawiamy jako zbiory ścieżek, na których dokonujemy operacji porównywania. Do oznaczenia dwóch elementów jako duplikatów wykorzystywana jest progowa funkcja podobieństwa.

Present-day data integration systems require not only ability to perform data exchange and data transformation, but also to deliver complete results, without unnecessary elements (duplicates). Duplicates appear when data from distributed sources is combined. It is possible, that the same real-world object in different data sources has different representation. Hence, there is a need for XML data cleansing, which requires solutions for duplicates detection in XML data. Related work concern relational data model. In the case of XML the problem is more difficult - XML data is organized hierarchically with non-regular structure. In the paper we discuss the methods of duplicates detection in XML documents, where XML trees are represented as sets of their paths. To tick two elements as duplicates we use some thresholded similarity measures.

Słowa kluczowe

duplikacja w danych XML wykrywanie duplikatów

duplicates in XML data detection duplicates

Wydawca

Wydział Elektroniki, Telekomunikacji i Informatyki Politechniki Gdańskiej

Czasopismo

Zeszyty Naukowe Wydziału ETI Politechniki Gdańskiej. Technologie Informacyjne

Rocznik

2008

Tom

T. 16

Strony

237--242

Opis fizyczny

Bibliogr. 13 poz., rys.

Twórcy

autor

Piłka T.

autor

Pankowski T.

Uniwersytet im. A. Mickiewicza w Poznaniu, Wydział Matematyki i Informatyki

Bibliografia

[1] Pankowski T., Cybulka J., Meissner. A: XML Schema Mappings in the Presence of Key Constraints and Value Dependencies, ICDT 2007, Workshop EROW'07, 2007.
[2] Weis M., Naumann F.: Detecting Duplicate Object in XML Documents, IQIS 2004, International Workshop on Information Quality in Information Systems, 2004.
[3] Weis M.: Fuzzy Duplicate Detection on XML Data, VLDB 2005 PhD Workshop, Trondheim, Norway, 2005.
[4] Winkler W. E.: Advanced methods for record linkage, Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC, 1994.
[5] Rafiei D., Moise D. lo, Sun D.: Finding Syntactic Similarities Between XML Documents, XANTEC 2006, International Workshop on XML Data Management Tools & Techniques, Kraków, 2006.
[6] Jaro M. A: Probabilistic linkage of large public health data files. Statistic in medicine 14:491-98, 1995.
[7] Quass D., Starkey P.: Record linkage for genealogical databases, KDD-2003, Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, 2003.
[8] Hernandez M. A, Stolec S. J.: The merge/purge problem for large databases. International Conference on Management of Data (SIGMOD), pages 127-138, San Jose, CA, May 1995.
[9] Puhlmann S., Weis M., Naumann F.: XML duplicate detection using sorted neigborhoods. International Conference on Extending Database Technology (EDBT), 2006.
[10] Ananthakrishna R., Chaudhuri S., Ganti V.: Eliminating fuzzy duplicates in data warehouses. International Conference on Very Large Databases (VLDB), Hong Kong, China, 2002.
[11] Nierman A, Jagadish H. V.: Evaluating structural similarity in XML documents, WebDB Workshop, Madison, June 2002.
[12] Zhang K., Shasha. D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245-1262, 1989.
[13] Pankowski T.: Reasoning About Data in XML Data Integration, IPMU 2006, Information Processing and Management of Uncertainty in Knowledge-based Systems - IPMU 2006, Vol. 3, Editions EDK, 2506-2513, Paris, 2006.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BPG4-0036-0002