Subpopulation Discovery in Epidemiological Data with Subspace Clustering

Niemann, U; Spiliopoulou, M.; Völzke, H; Kühn, J-P

doi:10.2478/fcds-2014-0015

Artykuł - szczegóły

Tytuł artykułu

Subpopulation Discovery in Epidemiological Data with Subspace Clustering

Autorzy

Niemann U , Spiliopoulou M. , Völzke H , Kühn J-P

Wybrane pełne teksty z tego czasopisma

Identyfikatory

DOI

10.2478/fcds-2014-0015

Warianty tytułu

Języki publikacji

Abstrakty

A prerequisite of personalized medicine is the identification of groups of people who share specific risk factors towards an outcome. We investigate the potential of subspace clustering for finding such groups in epidemiological data. We propose a workflow that encompasses clusterability assessment before cluster discovery and quality assessment after learning the clusters. Epidemiological usually do not have a ground truth for the verification of clusters found in subspaces. Hence, we introduce quality assessment through juxtaposition of the learned models to “models-of-randomness”, i.e. models that do not reflect a true cluster structure. On the basis of this workflow, we select subspace clustering methods, compare and discuss their performance. We use a dataset with hepatic steatosis as outcome, but our findings apply on arbitrary epidemiological cohort data that have tenths of variables and exhibit class skew.

Słowa kluczowe

Wydawca

Wydawnictwo Politechniki Poznańskiej

Czasopismo

Foundations of Computing and Decision Sciences

Rocznik

2014

Tom

Vol. 39, No. 4

Strony

271--300

Opis fizyczny

Bibliogr. 46 poz., rys., tab.

Twórcy

autor

Niemann U

Otto-von-Guericke University Magdeburg, Germany

autor

Spiliopoulou M.

Otto-von-Guericke University Magdeburg, Germany

autor

Völzke H

University Medicine Greifswald, Germany

autor

Kühn J-P

University Medicine Greifswald, Germany

Bibliografia

[1] B. Preim, P. Klemm, H. Hauser, K. Hegenscheid, S. Oeltze, K. Toennies, and H. Völzke, Visualization in Medicine and Life Sciences III, ch. Visual Analytics of Image-Centric Cohort Studies in Epidemiology. Springer, 2014.
[2] A. D. Hingorani, D. A. van der Windt, R. D. Riley, (...), W. Sauerbrei, D. G. Altman, and H. Hemingway, “Prognosis research strategy (PROGRESS) 4: Stratified medicine research,” BMJ: British Medical Journal, vol. 346, no. e5793, 2013.
[3] H. Völzke, C. Schmidt, K. Hegenscheid, J. Kühn, F. Bamberg, W. Lieb, H. Kroemer, N. Hosten, and R. Puls, “Population imaging as valuable tool for personalized medicine,” Clin Pharmacol Ther, vol. 92, no. 4, pp. 422-424, 2012.
[4] H. Völzke, D. Alte, . . . , R. Biffar, U. John, and W. Hoffmann, “Cohort profile: the Study of Health In Pomerania,” International Journal of Epidemiology, vol. 40, no. 2, pp. 294-307, 2011.
[5] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High Dimensional Data: A Review,” ACM SIGKDD Explorations Newsletter, vol. 6, pp. 90-105, 2004.
[6] K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong, “A survey on enhanced subspace clustering,” Data mining and knowledge discovery, vol. 26, pp. 332-397, 2013.
[7] A. Zimek, Data Clustering: Algorithms and Applications, ch. Clustering High- Dimensional Data, pp. 201-230. CRC Press, 2013.
[8] C. Zhang and R. L. Kodell, “Subpopulation-specific confidence designation for more informative biomedical classification,” Artificial Intelligence in Medicine, vol. 58, no. 3, pp. 155-163, 2013.
[9] S. Glaßer, U. Niemann, B. Preim, and M. Spiliopoulou, “Can we Distinguish Between Benign and Malignant Breast Tumors in DCE-MRI by Studying a Tumor's Most Suspect Region Only?,” in 26th International Symposium on Computer- Based Medical Systems (CBMS), pp. 77-82, 2013.
[10] U. Niemann, H. Völzke, J.-P. Kühn, and M. Spiliopoulou, “Learning and inspecting classification rules from longitudinal epidemiological data to identify predictive features on hepatic steatosis,” Expert Systems with Applications, vol. 41, pp. 5405-5415, September 2014.
[11] T. Hielscher, M. Spiliopoulou, H. Völzke, and J.-P. Kühn, “Using participant similarity for the classification of epidemiological data on hepatic steatosis,” in Proc. of the 27th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS'14), pp. 1-7, IEEE, 2014.
[12] M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning,” in Proc. of 17th Int. Conf. on Machine Learning, pp. 359-366, Morgan Kaufmann, 2000.
[13] P. Klemm, L. Frauenstein, D. Perlich, K. Hegenscheid, H. Völzke, and B. Preim, “Clustering Socio-demographic and Medical Attribute Data in Cohort Studies,” in Bildverarbeitung für die Medizin (BVM), pp. 180-185, Springer Berlin Heidelberg, 2014.
[14] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 61-72, 1998.
[15] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park, “Fast Algorithms for Projected Clustering,” in Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 61-72, 1999.
[16] D. Damian, M. Orešič, E. Verheij, J. Meulman, J. Friedman, A. Adourian, N. Morel, A. Smilde, and J. van der Greef, “Applications of a new subspace clustering algorithm (COSA) in medical systems biology,” Metabolomics, vol. 3, no. 1, pp. 69-77, 2007.
[17] L. S. Friedman and E. B. Keeffe, Handbook of Liver Disease. Library of Congress Cataloging-in-Publication Data, 2011.
[18] A. P. Levene and R. D. Goldin, “The epidemiology, pathogenesis and histopathology of fatty liver disease,” Histopathology, vol. 61, pp. 141-152, 2012.
[19] S. Bellentani, G. Bedogni, L.Miglioli, and C. Tiribelli, “The epidemiology of fatty liver,” European Journal of Gastroenterology & Hepatology, vol. 16, pp. 1087-1093, 2004.
[20] G. Bedogni, S. Bellentani, L. Miglioli, F. Masutti, M. Passalacqua, A. Castiglione, and C. Tiribelli, “The Fatty Liver Index: a simple and accurate predictor of hepatic steatosis in the general population,” BMC Gastroenterology, vol. 6, no. 33, 2006.
[21] X. Yuan, D. Waterworth, J. R. Perry, (...), T. M. Frayling, J. S. Kooner, and V. Mooser, “Impact of fatty liver disease on health care utilization and costs in a general population: A 5-year observation,” Gastroenterology, vol. 134, no. 1, pp. 85-94, 2008.
[22] H. Völzke, S. Schwarz, S. E. Baumeister, H. Wallaschofski, C. Schwahn, H. J. Grabe, T. Kohlmann, U. John, and M. Dören, “Menopausal status and hepatic steatosis in a general female population,” Gut, vol. 56, pp. 594-595, 2007.
[23] S. Baumeister, H. Völzke, P. Marschall, U. John, C. Schmidt, and D. Alte, “Impact of fatty liver disease on health care utilization and costs in the general population: a 5-year observation,” Gastroenterology, vol. 134, pp. 85-94, 2008.
[24] J.-P. Kühn, D. Hernando, B. Mensel, (...), J. Mayerle, N. Hosten, and S. B. Reeder, “Quantitative chemical shift-encoded MRI is an accurate method to quantify hepatic steatosis,” Journal of Magnetic Resonance Imaging, vol. 39, no. 6, pp. 1494-1501, 2014.
[25] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Third Edition. Morgan Kaufmann Publishers, 2012.
[26] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, 1996.
[27] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179-188, 1936.
[28] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker, “Classification of radar returns from the ionosphere using neural networks,” Johns Hopkins APL Tech. Dig, vol. 10, pp. 262-266, 1989.
[29] D. Dias, R. Madeo, T. Rocha, H. Biscaro, and S. Peres, “Hand movement recognition for brazilian sign language: A study using distance-based neural networks,” in International Joint Conference on Neural Networks (IJCNN 2009), pp. 697-704, 2009.
[30] K. Kailing, H.-P. Kriegel, and P. Kröger, “Density-Connected Subspace Clustering for High-Dimensional Data,” in Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.
[31] I. Assent, R. Krieger, E. Müller, and T. Seidl, “DUSC: Dimensionality Unbiased Subspace Clustering,” in ICDM, pp. 409-414, 2007.
[32] U. Niemann, “The potential of high-dimensional clustering for subpopulation discovery in epidemiological datasets.” Otto-von-Guericke University Magdeburg, Faculty of Computer Science, 2014. Master Thesis.
[33] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” J. Artif. Int. Res., vol. 6, pp. 1-34, Jan. 1997.
[34] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson/Addison-Wesley, 2006.
[35] P. J. Hanly and S. B. Ahmed, “Sleep Apnea and the Kidney: is sleep apnea a risk factor for chronic kidney disease?,” CHEST Journal, vol. 146, no. 4, pp. 1114-1122, 2014.
[36] J. Zhao, “Subspace clustering with gravitation.,” in Grundlagen von Datenbanken, 2010.
[37] J. Zhao and S. Conrad, “Automatic subspace clustering with density function.,” in DATA, pp. 63-69, 2012.
[38] E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl, “Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data,” in Ninth IEEE International Conference on Data Mining (ICDM'09), pp. 377-386, IEEE, 2009.
[39] S. Günnemann, E. Müller, I. Färber, and T. Seidl, “Detection of orthogonal concepts in subspaces of high dimensional data,” in Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1317-1326, ACM, 2009.
[40] G. Moise and J. Sander, “Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining: ACM, pp. 533-541, 2008.
[41] U. Fayyad and K. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” in Proc. of 17th Int. Conf. on Machine Learning, pp. 1022-1029, Morgan Kaufmann, 1993.
[42] M. J. Zaki, M. Peters, I. Assent, and T. Seidl, “Clicks: An effective algorithm for mining subspace clusters in categorical datasets,” Data & Knowledge Engineering, vol. 60, no. 1, pp. 51-70, 2007.
[43] G. Gan and J. Wu, “Subspace clustering for high dimensional categorical data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 2, pp. 87-94, 2004.
[44] E. Müller, I. Assent, and T. Seidl, “HSM: Heterogeneous subspace mining in high dimensional data,” in Scientific and Statistical Database Management, pp. 497-516, Springer, 2009.
[45] F. Cao, J. Liang, D. Li, and X. Zhao, “A weighting k-modes algorithm for subspace clustering of categorical data,” Neurocomputing, vol. 108, pp. 23-30, 2013.
[46] I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek, “On using class-labels in evaluation of clusterings,” in MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD, 2010.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-6d8f977d-090c-46ab-aa66-fe93ee930aef