A comparative study for outlier detection methods in high dimensional text data

Park, Cheong Hee

doi:10.2478/jaiscr-2023-0001

Artykuł - szczegóły

Tytuł artykułu

A comparative study for outlier detection methods in high dimensional text data

Autorzy

Park Cheong Hee

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.2478/jaiscr-2023-0001

Warianty tytułu

Języki publikacji

Abstrakty

Outlier detection aims to find a data sample that is significantly different from other data samples. Various outlier detection methods have been proposed and have been shown to be able to detect anomalies in many practical problems. However, in high dimensional data, conventional outlier detection methods often behave unexpectedly due to a phenomenon called the curse of dimensionality. In this paper, we compare and analyze outlier detection performance in various experimental settings, focusing on text data with dimensions typically in the tens of thousands. Experimental setups were simulated to compare the performance of outlier detection methods in unsupervised versus semisupervised mode and uni-modal versus multi-modal data distributions. The performance of outlier detection methods based on dimension reduction is compared, and a discussion on using k-NN distance in high dimensional data is also provided. Analysis through experimental comparison in various environments can provide insights into the application of outlier detection methods in high dimensional data.

Słowa kluczowe

curse of dimensionality dimension reduction high dimensional text data outlier detection

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2023

Tom

Vol. 13, No. 1

Strony

5--17

Opis fizyczny

Bibliogr. 42 poz., rys.

Twórcy

autor

Park Cheong Hee

cheonghee@cnu.ac.kr

Department of Computer Science and Engineering, Chungnam National University, 220 Gung-dong, Yuseong-gu Daejeon, 305-763, Korea

https://orcid.org/0000-0002-8233-2206

Bibliografia

[1] D. Hawkins. Identification of Outliers. Chapman and Hall, 1980.
[2] C. Aggarwal. Outlier analysis (2nd ed.) Springer, 2017.
[3] Caroline Cynthia and Thomas George. An outlier etection approach on credit card fraud detection using machine le ning: A comparative analysis on supervised and unsupervised learning. In: Peter J., Fernandes S., Alavi A. (eds) Intelligence in Big Data Technologies-Beyond the Hype. Advances in Intelligent Systems and Computing, 1167, 2021.
[4] H. Mazzawi, G. Dalai, D. Rozenblat, L. Ein-Dor, M. Ninio, O. Lavi, A. Adir, E. Aharoni, and E. Ker-16 Cheong Hee Park many. Anomaly detection in large databases using behavioral patterning. In ICDE, 2017.
[5] T. Li, J. Ma, and C. Sun. Dlog: diagnosing router events with syslogs for anomaly detection. The Journal of Supercomputing, 74(2):845–867, 2018.
[6] C. Park. Outlier and anomaly pattern detection on data streams. The journal of supercomputing, 75:6118–6128, 2019.
[7] H. Wang, M. Bah, and M. Hammad. Progress in outlier detection techniques: A survey. IEEE Access, 7, 2019.
[8] A. Boukerche, L. Zheng, and O. Alfandi. Outlier detection: Methods, models, and classification. ACM Computing Surveys, 53:1–37, 2020.
[9] X. Zhao, J. Zhang, and X. Qin. Loma: A local outlier mining algorithm based on attribute relevance analysis. Expert Systems with Applications, 84, 2017.
[10] X. Zhao, J. Zhang, X. Qin, J. Cai, and Y. Ma. Parallel mining of contextual outlier using sparse subspace. Expert Systems with Applications, 126, 2019.
[11] F. Kamalov and H. Leung. Outlier detection in high dimensional data. Journal of Information and Knowledge Management, 19, 2020.
[12] C. Park. A dimension reduction method for unsupervised outlier detection in high dimensional data(written in korean). Journal of KIISE. In press.
[13] S. Damaswanny, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets.In Proceeding of ACM SIGMOD, pages 427–438, 2000.
[14] E. Knorr and R. Ng. Finding intensional knowledge of distance-based outliers. In Proceeding of 25th International Conference on Very Large Databases, 1999.
[15] M. Sugiyama and K. Borgwardt. Rapid distancebased outlier detection via sampling. In International Conference on Neural Information Processing Systems, 2013.
[16] A. Zimek, E. Schubert, and H. Kriegel. A survey on unsupervised outlier detection in highdimensional numerical data. Statistical Analysis and Data Mining, 5:363–387, 2012.
[17] H. Kriegel, M. Schubert, and A. Zimek. Anglebased outlier detection in high-dimensional data. In Proceeding of KDD, pages 444–452, 2008.
[18] M. Goldstein and A. Dengel. Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. In Proceeding of KI, pages 59–63, 2012.
[19] B. Scholkopf, J. Platt, J. Shawe-Taylor, and A. Smola. Estimating the support of a highdimensional distribution. Neural computation, pages 1443–1471, 2001.
[20] M. Amer, M. Goldstein, and S. Abdennadher. Enhancing one-class support vector machines for unsupervised anomaly detection. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, 2013.
[21] L. Ruff, R. Vandermeulen, N. Gornitz, L. Deecke, S. Siddiqui, A. Binder, E. Muller, and M. Kloft. Deep one-class classification. In Proceeding of international conference on machine learning, 2018.
[22] M. Breunig, H. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proceeding of the ACM Sigmod International Conference on Management of Data, 2000.
[23] P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, Boston, 2006.
[24] F. Liu, K. Ting, and Z. Zhou. Isolation forest. In Proceedings of the 8th international conference on data mining, 2008.
[25] G. Susto, A. Beghi, and S. McLoone. Anomaly detection through on-line isolation forest: An application to plasma etching. In the 28th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), pages 89–94, 2017.
[26] L. Puggini and S. MCLoone. An enhanced variable selection and isolation forest based methodology for anomaly detection with oes data. Engineering Applications of Artificial Intelligence, 67:126–135, 2018.
[27] J. Kim, H. Naganathan, S. Moon, W. Chong, and S. Ariaratnam. Applications of clustering and isolation forest techniques in real-time building energy-consumption data: Application to leed certified buildings. Journal of energy Engineering, 143, 2017.
[28] J. Hofmockel and E. Sax. Isolation forest for anomaly detection in raw vehicle sensor data. In the 4th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2018), pages 411–416, 2018.
[29] J. Livesey. Kurtosis provides a good omnibus test for outliers in small samples. Clinical Biochemistry, 40:1032–1036, 2007.
[30] F. Liu, K. Ting, and Z. Zhou. On detecting clustered anomalies using sciforest. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2010.
[31] S. Hariri, M. Kind, and R. Brunner. Extended isolation forest. IEEE transactions on knowledge and data engineering, 33:1479–1489, 2021.
[32] H. Kriegel, P. Kroger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of PAKDD, 2009.
[33] A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedings of KDD, 2005.
[34] R. Duda, P. Hart, and D. Stork. Pattern classification (2nd ed.). Wiley-interscience, 2000.
[35] M. Shyu, S. Chen, K. Sarinnapakorn, and L. Chang. A novel anomaly detection scheme based on principal component classifier. In Proceedings of the IEEE Foundations and New Directions of Data Mining Workshop, 2003.
[36] P. Westfall. Kurtosis as peakedness, 1905-2014.r.i.p. The American Statistician, 68(3):191–195, 2014.
[37] D. Pena and F. Prieto. Multivariate outlier detection and robust covariance matrix estimation. Technometrics, 43:286–310, 2001.
[38] D. Greene and P. Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceeding of ICML, 2006.
[39] Y. Zhao, Z. Nasrullah, and Z. Li. Pyod: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20:1–7, 2019.
[40] A. Paszke, S. Gross, F. Massa, and et. al A. Lerer. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8026–8037, 2019.
[41] L. Abdallah, M. Badarna, W. Khalifa, and M. Yousef. Multikoc: Multi-one-class classifier based k-means clustering. Algorithms, 14(5):1–10, 2021.
[42] B. Krawczyk, M. Wozniak, and B. Cyganek. Clustering-based ensemble for one-class classification. Information sciences, 264:182–195, 2014.

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-0b04b8ab-e79b-4715-8a25-75859a32ae1d