Unsupervised machine learning in financial anomaly detection : clustering algorithms vs. dedicated methods

Woźniak, Radosław J.

doi:10.5604/01.3001.0054.8748

Artykuł - szczegóły

Tytuł artykułu

Unsupervised machine learning in financial anomaly detection : clustering algorithms vs. dedicated methods

Autorzy

Woźniak Radosław J.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.5604/01.3001.0054.8748

Warianty tytułu

Nienadzorowane uczenie maszynowe w wykrywaniu anomalii finansowych : algorytmy klasteryzacji a metody dedykowane

Języki publikacji

Abstrakty

The article presents the application of selected clustering algorithms for detecting anomalies in financial data compared to several dedicated algorithms for this problem. To apply clustering algorithms for anomaly detection, the Determine Abnormal Clusters Algorithm (DACA) was developed and implemented. This parameterized script (DACA) allows clusters containing anomalies to be automatically detected on the basis of defined distance measures. This kind of operation allows clustering algorithms to be quickly and efficiently adapted to anomaly detection. The prepared test environment has allowed for the comparison of selected clustering algorithms. K-Means, Hierarchical Cluster Analysis, K-Medoids, and anomaly detection: Stochastic Outlier Selection, Isolation Forest, Elliptic Envelope. The research has been carried out on real financial data, in particular on the income declared in the asset declarations of the targeted professional group. The experience of financial experts has been used to assess anomalies. Furthermore, the results have been evaluated according to a number of popular classification and clustering measures. The highest result for the investigated financial problem was provided by the K-Medoids algorithm in combination with the DACA script. It is worthwhile to conduct future research on the introduced solutions as an ensemble method.

Artykuł przedstawia zastosowanie wybranych algorytmów klasteryzacji do wykrywania anomalii w danych finansowych w porównaniu do kilku dedykowanych algorytmów dla tego problemu. W celu wykorzystania algorytmów klasteryzacji do wykrywania anomalii opracowano i zaimplementowano Determine Abnormal Clusters Algorithm (DACA). Ten sparametryzowany skrypt umożliwia na automatyczne wykrycie klastrów zawierających anomalie, na podstawie zdefiniowanych miar odległości. Takie działanie pozwala na szybkie i skuteczne dostosowanie algorytmów klasteryzacji do wyszukiwania anomalii. Przygotowane środowisko badawcze pozwoliło na porównanie wybranych algorytmów klasteryzacji: Hierarchical Cluster Analysis, K-Means, K-Medoids oraz wykrywania anomalii: Stochastic Outlier Selection, Isolation Forest, Elliptic Envelope, Badania przeprowadzono na rzeczywistych danych finansowych, w szczególności dotyczących dochodów zadeklarowanych w oświadczeniach majątkowych wybranej grupy zawodowej. Wykorzystano doświadczenie ekspertów finansowych do oceny anomalii. Ponadto, wyniki oceniono na podstawie wielu popularnych miar klasyfikacji i klasteryzacji. Najlepsze wyniki dla badanego problemu finansowego przedstawił algorytm K-Medoids w połączeniu ze skryptem DACA. W przyszłości warto przebadać metody złożone oparte o przedstawione rozwiązanie.

Słowa kluczowe

anomaly detection classification clustering financial fraud finance

wykrywanie anomalii klasteryzacja uczenie maszynowe oszustwa finansowe finanse

Wydawca

Instytut Teleinformatyki i Automatyki, Wydział Cybernetyki, Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego

Czasopismo

Przegląd Teleinformatyczny

Rocznik

2023

Tom

T. 11, Nr 1-4 (29)

Strony

29--46

Opis fizyczny

Bibliogr. 25 poz., tab., wykr.

Twórcy

autor

Woźniak Radosław J.

radoslaw.wozniak@wat.edu.pl

Institute of Information Systems, Faculty of Cybernetics, MUT Kaliskiego 2, 00-908 Warsaw, Poland

https://orcid.org/0009-0003-8213-7472

Bibliografia

[1] Angiulli F., Clara P., Fast Outlier Detection in High Dimensional Spaces, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2431, 2002, pp. 15-27.
[2] Berkhin P., A survey of clustering data mining techniques, Grouping Multidimensional Data: Recent Advances in Clustering, 2006, pp. 25-71.
[3] Chmielewski, M., et al., Military and Crisis Management Decision Support Tools for Situation Awareness Development Using Sensor Data Fusion, Advances in Intelligent Systems and Computing, 656, 2018, pp. 189-199.
[4] Czerniec I., Oświadczenia majątkowe. Polska, Przegląd antykorupcyjny czasopismo Centralnego Biura Antykorupcyjnego, Centralne Biuro Antykorupcyjne, 1, 2019, pp. 53-77.
[5] EZE Peter U., et al., Anomaly Detection in Endemic Disease Surveillance Data Using Machine Learning Techniques, Healthcare (Basel), vol. 11(13), 2023, p. 1896.
[6] Fijałkowska J., Fałszowanie informacji ekonomiczno-finansowej w sprawozdawczości przedsiębiorstw, Etyka w służbie biznesu, Studia i Monografie, 44, 2013, 111- 121.
[7] He Z., et al. Discovering Cluster-Based Local Outliers, Pattern Recognition Letters, vol. 24, no. 9-10, 2003, pp. 1641-1650.
[8] Janssens J. H. M., Huszár F., Postma E., Stochastic outlier selection, Technical Report, Technical report TiCC TR, Tilburg University, vol 1, 2012.
[9] John H., Naaz S., Credit Card Fraud Detection Using Local Outlier Factor and Isolation Forest, International Journal of Computer Sciences and Engineering, vol. 7, no. 4, 2019, pp. 1060-1064.
[10] Johnson S. C, Hierarchical clustering schemes. Psychometrika, 32, 1967, 241-254.
[11] Jun S., An Ensemble Method for Validation of Cluster Analysis, International Journal of Computer Science Issues (IJCSI), vol 8(6), 2011, pp. 26-30.
[12] Kaufman L., Rousseeuw P., Clustering by means of medoids, In Statistical Data Analysis Based on the L1-Norm and Related Methods, 1987, pp. 405-416.
[13] Kaufman L., Rousseeuw P., Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 1990.
[14] Konopka E., Pelikant A., Zastosowanie metod grupowania w analizie sieci społecznościowych, Zeszyty Naukowe WSInf, vol. 13(1), 2014, pp. 13-37.
[15] Kutera M., Audyt finansowy, a przestępstwa gospodarcze, Zeszyty Teoretyczne Rachunkowości, 105(49), 2009, pp 109-121.
[16] Liu F. T., Ting K. M., Zhou Z.-H., Isolation forest, In Proceedings of the 2008 Eighth IEEE International Conference on Data Minin, IEEE Computer Society, 1963, pp. 413-422.
[17] Lloyd S. Least Squares Quantization in PCM, IEEE Transactions on Information Theory, vol. 28(2) 1982, pp. 129-137.
[18] Macqueen J. B., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, 1967, pp. 281-297. Radoslaw J. Woźniak Teleinformatics Review, 1-4/202346.
[19] Micherda B., Szulc M., Analiza finansowa w badaniu możliwości popełnienia oszustw, Zeszyty Naukowe Uniwersytetu Ekonomicznego w Krakowie, 785, 2008, pp. 21-31.
[20] Najgebauer A, et al., Quantitative Methods of Strategic Planning Support: Defending the Front Line in Europe, Advances in Intelligent Systems and Computing, vol. 656, 2018, pp. 290-299.
[21] Park H.-S., Jun C.-H., A simple and fast algorithm for k-medoids clustering, Expert Systems with Applications, vol. 36(2, part 2), 2009, pp. 3336-3341.
[22] Rousseeuw Pj, Van Driessen K., A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, vol. 41(3), 1999, pp. 212-223.
[23] Stojanović B., et al., Follow the Trail: Machine Learning for Fraud Detection in Fintech Applications, Sensors (Basel, Switzerland), vol. 21(5), 2021, pp. 1-4.
[24] Wang R., et al. Local Dynamic Neighborhood Based Outlier Detection Approach and Its Framework for Large-Scale Datasets, Egyptian Informatics Journal, vol. 22, no. 2, 2021, pp. 125-132.
[25] Ward J., Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, vol 58(301), 1963, pp. 236-244.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-475b162f-c6fa-46c8-9642-ad39c45f522a