Czasopismo
2010
|
Vol. 98, nr 1
|
71-86
Tytuł artykułu
Autorzy
Wybrane pełne teksty z tego czasopisma
Warianty tytułu
Języki publikacji
Abstrakty
Outlier detection in high dimensional data sets is a challenging data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Searching for all possible subspaces can lead to the problem called "the curse of dimensionality". Due to the existence of many irrelevant dimensions in high dimensional data sets, it is of paramount importance to eliminate the irrelevant or unimportant dimensions and identify interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques or subspace-based clustering methods. The dimension-growth subspace clustering techniques can find interesting subspaces in relatively lower dimension spaces, while dimension-reduction approaches try to group interesting subspaces with larger dimensions. This paper aims to investigate the possibility of detecting outliers in correlated subspaces. We present a novel approach by identifying outliers in the correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ a dimension-reduction method to find the correlated subspaces. Based on the correlated subspaces obtained, we introduce another criterion called "shape factor" to rank most important subspaces in the projected subspaces. Finally, outliers are distinguished from most important subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional data sets.
Czasopismo
Rocznik
Tom
Strony
71-86
Opis fizyczny
Bibliogr. 26 poz., tab., wykr.
Twórcy
Bibliografia
- [1] Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., Park, J. S.: Fast Algorithms for Projected Clustering, SIGMOD Rec., 28(2), 1999, 61-72.
- [2] Aggarwal, C. C., Yu, P. S.: Outlier Detection for High Dimensional Data, Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001.
- [3] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proceedings of the 1998 ACM SIGMOD International Conference On Management of Data, 1998.
- [4] Asuncion, A., Newman, D.: UCI Machine Learning Repository, 2007.
- [5] Barkow, S., Bleuler, S., Prelic, A., Zimmermann, P., Zitzler, E.: BicAT: A Biclustering Analysis Toolbox, Bioinformatics, 22(10), 2006, 1282-1283.
- [6] Barnett, V., Lewis, T.: Outliers in Statistical Data, John Wiley&Sons, 1994.
- [7] Bellman, R.: Dynamic Programming, Princeton University Press, Princeton, NJ, 1957.
- [8] Breunig, M. M., Kriegel, H.-P., Ng, R. T., Sander, J.: LOF: Identifying Density-based Local Outliers, Proceedings of the 2000 ACM SIGMOD International Conference On Management of Data, 2000.
- [9] Cheng, C. H., Fu, A. W.-C., Zhang, Y.: Entropy-based Subspace Clustering for Mining Numerical Data, Proceedings of the 1999 ACMSIGKDD International Conference on Knowledge Discovery and DataMining, 1999.
- [10] Cheng, Y., Church, G.: Biclustering of Expression Data, Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000.
- [11] DuMouchel,W., Schonlau,M.: A Fast Computer Intrusion Detection Algorithmbased on Hypothesis Testing of Command Transition Probabilities, Proceedings of Knowledge Discovery and Data Mining, 1998.
- [12] Fawcett, T., Provost, F. J.: Adaptive Fraud Detection, Data Mining and Knowledge Discovery Journal, 1(3), 1997, 291-316.
- [13] Glymour, C., Madigan, D., Pregibon, D., Smyth, P.: Statistical Themes and Lessons for Data Mining, Data Mining and Knowledge Discovery, 1(1), 1997, 11-28.
- [14] Hartigan, J. A.: Direct Clustering of a DataMatrix, Journal of the American Statistical Association, 67(337), 1972, 123-129.
- [15] Hawkins, D.: Identification of Outliers, Chapman and Hall, London, 1980.
- [16] Knorr, E. M., Ng, R. T.: Algorithms for Mining Distance-based Outliers in Large Datasets, Proceedings of the 24th International Conference on Very Large Data Bases, 1998.
- [17] Leng, J., Li, J., Fu, A. W.-C.: Exploring Most Interesting Subspaces for Effective Top N Outlier Detection. Submitted for publication.
- [18] Madeira, S., Oliveira, A.: Biclustering Algorithms for Biological Data Analysis: A Survey, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 2004, 24-45.
- [19] Mirkin, B.: Mathematical Classification and Clustering, Dordrecht: Kluwer, 1996.
- [20] Parsons, L., Haque, E., Liu, H.: Subspace Clustering for High Dimensional Data: A Review, SIGKDD Explorer Newsletter, 6(1), 2004, 90-105.
- [21] Procopiuc, C. M., Jones, M., Agarwal, P. K., Murali, T. M.: A Monte Carlo algorithm for fast projective clustering, Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002.
- [22] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient Algorithms for Mining Outliers from Large Data Sets, SIGMOD Record, 29(2), 2000, 427-438.
- [23] Rodgers, J. L., Nicewander, W. A.: Thirteen Ways to Look at the Correlation Coefficient, The American Statistician, 42(1), 1988, 59-66.
- [24] Rozman, I.: ImprovingMining ofMedical Data by Outliers Prediction, Proceedings of 18th IEEE Symposium on Computer-Based Medical Systems, 2005.
- [25] Yang, J., Wang, W., Wang, H., Yu, P. S.: _-cluster: Capturing Subspace Correlation in a Large Data Set, Proceedings of 18th IEEE International Conference on Data Engineering, 2002.
- [26] Zhang, J., Wang, H.: Detecting Outlying Subspaces for High-dimensional Data: the new task, algorithms, and performance, Knowledge and Information Systems, 10(3), 2006, 333-355.
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.baztech-article-BUS8-0010-0005