Wyniki wyszukiwania - BazTech

Ograniczanie wyników

1 Przegląd Elektrotechniczny

1 2012

Znaleziono wyników: 1

Liczba wyników na stronie

Wyniki wyszukiwania

Wyszukiwano:
w słowach kluczowych: approximateduplicate data

Sortuj według:

Ogranicz wyniki do:

Guo L., Wang W., Chen F., Tang X., Wang W.

Przegląd Elektrotechniczny

2012

R. 88, nr 1b

26-30

The changing information technology makes data increase exponentially in all areas, the quality of the huge amounts of data is the core problems. Data cleaning is an effective technology to solve data quality problems. This paper focuses on the duplicate data cleaning techniques. It studies the quality of the data from the architectural level, the instance-level problems, the multi-source single-source problems, duplicated records cleaning application platform and the evaluation criteria. In these studies, a improved novel detection method adopts the fuzzy clustering algorithm with the Levenshtein distance combination to data cleaning .It can accurately and quickly detect and remove duplicate raw data. The improved method includes a similar duplicate records detection process, the major system framework design, system function modules of the implementation process and results analysis in the paper. The precision and recall rates are higher than several other data cleaning methods. These comparisons confirm the validity of the method. The experimental results exhibit that the proposed method is effective in data detection and cleaning process.

Artykuł proponuje nowe metody czyszczenia danych z uwzględnieniem liczby przypadków, wielu źródeł, podwójnych rekordów i innych kryteriów oceny. Ulepszona metoda detekcji wykorzystuje algorytm rozmytego klastrowania w dystansem Levenshteina. W ten sposób szybko wykrywane są i usuwane podwójne wiersze danych.