PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

A sparse pair-preserving centroid-based supervised learning method for high-dimensional biomedical data or images

Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
In various biomedical applications designed to compare two groups (e.g. patients and controls in matched case-control studies), it is often desirable to perform a dimensionality reduction in order to learn a classification rule over high-dimensional data. This paper considers a centroid-based classification method for paired data, which at the same time performs a supervised variable selection respecting the matched pairs design. We propose an algorithm for optimizing the centroid (prototype, template). A subsequent optimization of weights for the centroid ensures sparsity, robustness to outliers, and clear interpretation of the contribution of individual variables to the classification task. We apply the method to a simulated matched case-control study dataset, to a gene expression study of acute myocardial infarction, and to mouth localization in 2D facial images. The novel approach yields a comparable performance with standard classifiers and outperforms them if the data are contaminated by outliers; this robustness makes the method relevant for genomic, metabolomic or proteomic high-dimensional data (in matched case-control studies) or medical diagnostics based on images, as (excessive) noise and contamination are ubiquitous in biomedical measurements.
Twórcy
autor
  • The Czech Academy of Sciences, Institute of Computer Science, Pod Vodárenskou věží 2, 182 07 Praha 8, Czech Republic
  • The Czech Academy of Sciences, Institute of Computer Science, Prague, Czech Republic
Bibliografia
  • [1] Liang S, Ma A, Yang S, Wang Y, Ma Q. A review of matched- pairs feature selection methods for gene expression data analysis. Comput Struct Biotechnol J 2018;16:88–97.
  • [2] Ryan TP. Modern experimental design. Hoboken: Wiley; 2007.
  • [3] Dziuda DM. Data mining for genomics and proteomics: analysis of gene and protein expression data. New York: Wiley; 2010.
  • [4] Balasubramanian R, Houseman EA, Coull BA, Lev MH, Schwamm LH, Betensky RA. Variable importance in matched case-control studies in settings of high dimensional data. Appl Stat 2014;63:639–55.
  • [5] Asafu-Adjei J, Tadesse M, Coull B, Balasubramanian R, Lev M, Schwamm L, et al. Bayesian variable selection methods for matched case-control studies. Int J Biostat 2017;13:1–23.
  • [6] Schwender H, Ickstadt K, Rahnenführer J. Classification with high-dimensional genetic data: assigning patients and genetic features to known classes. Biom J 2008;50:911–26.
  • [7] Liu S, Xu C, Zhang Y, Liu J, Yu B, Liu X, et al. Feature selection of gene expression data for cancer classification using double RBF kernels. BMC Bioinform 2018;19:396.
  • [8] Sen L, Sen Y, Dayang L, Jiechao M, Yuan T, Jing Z, et al. A novel matched-pairs feature selection method considering with tumor purity for differential gene expression analyses. Math Biosci 2019;311:39–48.
  • [9] Taskin G, Kaya H, Bruzzone L. Feature selection based on high dimensional model representation for hyperspectral images. IEEE Trans Image Process 2017;26:2918–28.
  • [10] Adler W, Brenning A, Potapov S, Schmid M, Lausen B. Ensemble classification of paired data. Comput Stat Data Anal 2011;55:1933–41.
  • [11] Brenning A, Lausen B. Estimating error rates in the classification of paired organs. Stat Med 2008;27:4515–31.
  • [12] Chen L, Li JR, Zhang YH, Feng KY, Wang SP, Zhang YH, et al. Identification of gene expression signatures across different types of neural stem cells with the Monte- Carlo feature selection method. J Cell Biochem 2018;119:3394–403.
  • [13] Freitas AA. Comprehensible classification models – a position paper. ACM SIGKDD Explor 2013;15:1–10.
  • [14] Zhang M, Golland P. Statistical shape analysis: from landmarks to diffeomorphisms. Med Image Anal 2016;33:155–8.
  • [15] Böhringer S, de Jong MA. Quantification of facial traits. Front Genet 2019;10:397.
  • [16] Mehta S, Shen X, Gou J, Niu D. A new nearest centroid neighbor classifier based on K local means using harmonic mean distance. Information 2018;9:234.
  • [17] Herzlinger G, Grosman L. AGMT3-D: a software for 3-D landmarks-based geometric morphometric shape analysis of archaeological artifacts. PLOS ONE 2019;13:e0207890.
  • [18] Mazurowski MA, Lo JY, Harrawood BP, Tourassi GD. Mutual information-based template matching scheme for detection of breast masses: from mammography to digital breast tomosynthesis. J Biomed Inform 2011;44:815–23.
  • [19] John J, Nair MS, Kumar PRA, Wilscy M. A novel approach for detection and delineation of cell nuclei using feature similarity index measure. Biocybern Biomed Eng 2016;36:76–88.
  • [20] Uchiyama Y, Abe A, Muramatsu C, Hara T, Shiraishi J, Fujita H. Eigenspace template matching for detection of lacunar infarcts on MR images. J Digit Imaging 2015;28:116–22.
  • [21] Pérez-Ramírez Ú, Arana E, Moratal D. Brain metastases detection on MR by means of three-dimensional tumor-appearance template matching. J Magn Reson Imaging 2016;44:642–52.
  • [22] Lu Y, Gao K, Zhang T, Xu T. A novel image registration approach via combining local features and geometric invariants. PLOS ONE 2018;13:e0190383.
  • [23] Bator M, Nieniewski M. Detection of cancerous masses in mammograms by template matching: optimization of template brightness distribution by means of evolutionary algorithm. J Digit Imaging 2012;25:162–72.
  • [24] Palaniswamy S, Thacker NA, Klingenberg CP. Automatic identification of landmarks in digital images. IET Comput Vis 2009;4:247–60.
  • [25] Megreya AM, Bindemann M. Feature instructions improve face-matching accuracy. PLOS ONE 2018;13:e0193455.
  • [26] Mitteroecker P, Gunz P. Advances in geometric morphometrics. Evol Biol 2009;36:235–47.
  • [27] Mlozniak D, Piotrkiewicz M. Method of automatic recognition and other solutions used in new computer program for full decomposition of EMG signals. Biocybern Biomed Eng 2015;35:22–9.
  • [28] Guo Y, Hastie T, Tibshirani R. Regularized discriminant analysis and its application in microarrays. Biostatistics 2007;8:86–100.
  • [29] Kalina J. A robust pre-processing of BeadChip microarray images. Biocybern Biomed Eng 2018;38:556–63.
  • [30] Huber PJ. Robust statistics. 2nd ed. New York: Wiley; 2009.
  • [31] Merhav N, Lee CH. A minimax classification approach with application to robust speech recognition. IEEE Trans Speech Audio Process 1993;1:90–100.
  • [32] Lanckriet GRG, Ghaoui LE, Bhattacharyya C, Jordan MI. A robust minimax approach to classification. J Mach Learn Res 2002;3:555–82.
  • [33] Deller JR, Radha H, McCormick J, Wang H. Nonlinear dependence in the discovery of differentially expressed genes. ISRN Bioinform 2012;2012:564715.
  • [34] Thompson WH, Fransson P. On stabilizing the variance of dynamic functional brain connectivity time series. Brain Connect 2016;6:735–46.
  • [35] Bobrowski L, Lukaszuk T, Lindholm B, Stenvinkel P, Heimburger O, Axelsson J, et al. Selection of genetic and phenotypic features associated with inflammatory status of patients on dialysis using relaxed linear separability method. PLOS ONE 2014;9:e86630.
  • [36] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York: Springer; 2009.
  • [37] Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis. Cambridge: Cambridge University Press; 2004.
  • [38] Xanthopoulos P, Pardalos PM, Trafalis TB. Robust data mining. New York: Springer; 2013.
  • [39] Nocedal J, Wright WJ. Numerical optimization. New York: Springer; 2006.
  • [40] Lukšan L, Matonoha C, Vlček J. Interior point method for non-linear non-convex optimization. Numer Linear Algebr Appl 2004;11:431–53.
  • [41] Tibshirani R, Hastie T, Narasimhan B. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 2003;18:104–17.
  • [42] Steinwart I. Sparseness of support vector machines. J Mach Learn Res 2003;4:1071–105.
  • [43] Haykin S. Neural networks and learning machines: a comprehensive foundation. 2nd ed. Upper Saddle River: Prentice Hall; 2009.
  • [44] Avalos M, Pouyes H, Grandvalet Y, Orriols L, Lagarde E. Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm. BMC Bioinform 2015;16(Suppl. 6):S1.
  • [45] Deng H, Runger G. Feature selection via regularized trees. Proceedings of the International Joint Conference on Neural Networks IJCNN 2012;2012:6252640.
  • [46] Delaigle A, Hall P. Achieving near perfect classification for functional data. J R Stat Soc 2012;74:267–86.
  • [47] Hall P, Pham T. Optimal properties of centroid-based classifiers for very high-dimensional data. Ann Stat 2010;38:1071–93.
  • [48] Xu H, Caramanis C, Mannor S. Robustness and regularization of support vector machines. J Mach Learn Res 2009;10:1485–510.
  • [49] Balasubramanian R. RPCLR (Random-penalized conditional logistic regression), R package version 1.0; 2012, https://cran.r-project.org/web/packages/RPCLR.
  • [50] Böhringer S, Vollmar T, Tasse C, Würtz RP, Gillessen- Kaesbach G, Horsthemke B, et al. Syndrome identification based on 2D analysis software. Eur J Hum Genet 2006;14:1082–9.
  • [51] Kalina J. A nonparametric classification algorithm based on optimized templates. Nonparametric statistics, Vol. 250 of Springer proceedings in mathematics and statistics. 2018;119–32.
  • [52] Viola P, Jones MJ. Robust real-time face detection. Int J Comput Vis 2004;57:137–54.
  • [53] Zhu X, Ramanan D. Face detection, pose estimation, and landmark localization in the wild. IEEE Conference on Computer Vision and Pattern Recognition 2012. New York: IEEE; 2012. p. 2879–86.
  • [54] Ramanan D. Full code (MATLAB) for training and testing; 2018, https://www.cs.cmu.edu/deva/papers/face/.
  • [55] Jurečková J, Picek J, Schindler M. Robust statistical methods with R. 2nd ed. Boca Raton: CRC Press; 2019.
  • [56] Kumar N, Hoque M, Shahjaman M, Islam S, Mollah M. Metabolomic biomarker identification in presence of outliers and missing values. BioMed Res Int 2017;2017:2437608.
  • [57] Kalina J. Implicitly weighted methods in robust image analysis. J Math Imaging Vis 2012;44:449–62.
  • [58] Wójcik PI, Kurdziel M. Training neural networks on high-dimensional data using random projection. Pattern Anal Appl 2019;22:1221–31.
Uwagi
PL
Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-c39ede36-0d4c-49b7-b3c6-16c6a690f3da
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.