Artificial intelligence applications in anomaly identification detection of big database

Thang, Phan Huy; Anh, Nguyen Thi Ngoc

doi:10.15439/2021R11

Artykuł - szczegóły

Tytuł artykułu

Artificial intelligence applications in anomaly identification detection of big database

Autorzy

Thang Phan Huy , Anh Nguyen Thi Ngoc

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2021R11

Warianty tytułu

Konferencja

Sixth International Conference on Research in Intelligent and Computing

Języki publikacji

Abstrakty

Data matching is the process of finding, matching, and combining records from many databases or even within one database that belong to the same entities. All parts of the data matching process have been improved during the previous decade as a result of research in various disciplines such as applied statistics, data mining, machine learning, database administration, and digital libraries.Indeed, with the significant advance in artificial intelligence over the past decade, all aspects of the data identification process, especially on how to improve the accuracy of data matching. Firstly, this paper presents the process of comparing data, detailing the steps to perform pre-processing data, comparing the data fields of each record, classification, and quality assessment. Secondly, the paper introduces a method to expand the problem of identifying duplicate objects with big data. Third, the paper also provides specific aspects of unstructured data matching times. Moreover, the methodology of solving big data matching problems by machine learning is proposed. Finally, the proposed method is applied to the problem of database cleanup and identification of identifier abnormalities at the national credit centre CIC with correct results from 96\% to 98\%. The achieved results are not only theoretical but also practical in business operations at CIC.

Słowa kluczowe

big data abnormality detection duplicate profiles similarity artificial intelligence

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2021

Tom

Vol. 27

Strony

87--92

Opis fizyczny

Bibliogr. 10 poz., rys., tab.

Twórcy

autor

Thang Phan Huy

thangph@creditinfo.org.vn

National Credit Information Center (CIC) Hanoi, Vietnam

autor

Anh Nguyen Thi Ngoc

anh.nguyenthingoc@hust.edu.vn

School of Applied Mathematics and Informatics Hanoi University of Science and Technology CMC institute of science and technology Hanoi, Vietnam

https://orcid.org/+0000-0002-6555-9740

Bibliografia

1. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, Duplicate record detection: A Survey, In: IEEE Transactions on knowledge and data engineering 2007, Vol.19.
2. G. Ranganathan, V.Bindhu,. Jenifer Raj, Duplicate record detection using intelligent approaches, In: International Journal of Pure and Applied Mathematics 2018, Vol.119, No.12, pp.13077-13087.
3. Peter Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer (2012).
4. Batini, C., Scannapieco, and M.: Data quality: Concepts, methodologies and techniques. Data-Centric Systems and Applications. Springer (2006).
5. Arasu, A., Götz, M., Kaushik el at: On active learning of record matching packages.In: ACM SIGMOD, pp.783-794. Indianapolis (2010).
6. Alvarez, R., Jonas, J., Winkler, W., Wright, R .: Interstate voter registration database matching: the Oregon-Washington 2008 pilot project. In: Workshop on Trustworthy Elections, pp.17-17. USENIX Association (2009).
7. Roya Hassanian-esfahani, Mohammad-javad Kargar , Sectional MinHash for near-duplicate detection, In: Expert Systems with Applications, Volume 99, 1 June 2018, pp.203–212.
8. Arfa Skandar, Mariam Rehman,Maria Anjum, An Efficient Duplication Record Detection Algorithm for Data Cleansing, In: International Journal of Computer Applications, Volume 127, October 2015, pp.28-37.
9. Djulaga Hadzic and Nermin Sarajlic, Methodology for fuzzy duplicate record identification based on the semantic-syntactic information of similarity, In Journal of King Saud University - Computer and Information Sciences, Volume 32, 2020, pp.126-136.
10. Toan Nguyen Mau and Van-Nam Huynh, An LSH-based k-representatives clustering method for large categorical data, Neurocomputing, volume 463, pages 29-44, year 2021.

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-d56d5b0f-0f5f-451c-87c3-e77906e547bb