The Monte Carlo feature selection and interdependency discovery is unbiased

Dramiński, M.; Kierczak, M.; Nowak-Brzezińska, A.; Koronecki, J.; Komorowski, J.

Artykuł - szczegóły

Tytuł artykułu

The Monte Carlo feature selection and interdependency discovery is unbiased

Autorzy

Dramiński M. , Kierczak M. , Nowak-Brzezińska A. , Koronecki J. , Komorowski J.

Treść / Zawartość

Pełne teksty:

httpwww_bg_utp_edu_plartcc2011draminski-et-al.pdf

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

We show that the Monte Carlo feature selection algorithm for supervised classification proposed, by Dramiński et al. (2008), is not biased towards features with many categories (levels or values). While the algorithm, later extended to include the functionality of discovering interdependencies between features, is surprisingly simple and has been successfully used on many biological data and transactional data of commercial origin, and it has never revealed any bias of the type mentioned, the alleged property of its unbiasedness required a closer scrutiny which is thus provided here. Admittedly, the algorithm does reveal some bias coming from another source, but it is negligible. Hence our final claim is that the algorithm is practically unbiased and the results it provides can be considered fully reliable.

Słowa kluczowe

supervised classification feature selection feature interactions high-dimensional problems applications to genomic and proteomic data

Wydawca

Systems Research Institute, Polish Academy of Sciences

Czasopismo

Control and Cybernetics

Rocznik

2011

Tom

Vol. 40, no 2

Strony

199--211

Opis fizyczny

Bibliogr. 18 poz., wykr.

Twórcy

autor

Dramiński M.

autor

Kierczak M.

autor

Nowak-Brzezińska A.

autor

Koronecki J.

autor

Komorowski J.

Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland

Bibliografia

Archer, K.J. and Kimes, R.V. (2008) Empirical Characterization of Random Forest Variable Importance Measures. Comp Stat & Data Anal, 52(4), 2249-2260.
Breiman, L. and Cutler, A. (2008) Random Forests - Classification/Clustering Manual. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Chrysostomou, K., Chen, S.Y. and Liu, X. (2008) Combining Multiple Classifiers for Wrapper Feature Selection. Int. J. Data Mining, Modelling and Management, 1, 91-102.
Diaz-Uriarte, R. and de Andres, S.A. (2006) Gene Selection and Classification of Microarray Data Using Random Forest. BMC Bioinformatics, 7(3), doi:10.1186/1471-2105-7-3.
Dramiński,M.,Rada-Iglesias,A.,Enroth, S.,Wadelius,C.,Koronacki, J., Komorowski, J. (2008)Monte Carlo Feature Selection for Supervised Classification. Bioinformatics, 24(1), 110-117.
Dramiński, M., Kierczak, M., Koronacki, J., Komorowski, J. (2010) Monte Carlo feature selection and interdependency discovery in supervised classification. In: J. Koronacki, Z.W. Ras, S.T. Wierzchon, J. Kacprzyk, eds., Advances in Machine Learning, vol. II, Springer, 371-385.
Dudoit, S. and Fridlyand, J. (2003) Classification in Microarray Experiments. In: T. Speed, ed., Statistical Analysis of Gene Expression Microarray Data, Chapman & Hall/CRC, 93-158.
Hothorn, T., Hornik, K. and Zeileis, A. (2006) Unbiased Recursive Partitioning: A Conditional Inference Framework. J. Computational and Graphical Statistics, 15, 651-674.
Kierczak,M., Ginalski,K., Dramiński,M., Koronacki, J., Rudnicki, W. and Komorowski, J. (2009) A Rough Set-BasedModel of HIV-1 Reverse Transcriptase Resistome. Bioinformatics and Biology Insights, 3, 109- 127. http://www.la-press.com/a-rough-set-based-model-of-hiv-1-reversetranscriptase-resistome-a1685
Kierczak,M., Dramiński,M., Koronacki, J. and Komorowski, J. (2010) Computational analysis of molecular interaction networks underlying change of HIV-1 resistance to selected reverse transcriptase inhibitors. Bioinformatics and Biology Insights 4, 137-146. http://www.la-press.com/computational-analysis-of-molecular-interactionnetworks- underlying-ch-article-a2395)
Kierczak, M. (2009) From Physicochemical Properties to Interdependency Networks: A Monte Carlo Approach to Modeling HIV-1 Resistome and Post-translational Modifications. PhD Thesis, Uppsala University (for an introduction to the thesis see Publications at http://www.kierczak.pl)
Li,Y., Campbell,C. and Tipping,M. (2002) Bayesian Automatic Relevance Determination Algorithms for Classifying Gene Expression data. Bioinformatics, 18,(10), 1332-1339.
Lu,C., Devos,A., Suykens, J.A.K., Arus,C. and Van Huffel, S. (2007) Bagging Linear Sparse Bayesian Learning Models for Variable Selection in Cancer Diagnosis. IEEE Trans Inf Technol Biomed, 11, 338-347.
Saeys, Y., Inza, I. and Larrañaga, P. (2007) A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics, 23 (19), 2507-2517.
Strobl,C., Boulesteix,A.-L., Zeileis,A. and Hothorn,T. (2007) Bias in Random Forest Variable Importance Measures: Illustrations, Sources, and a Solution. BMC Bioinformatics, 8(25), doi:10.1186/1471-2105-8-25.
Strobl,C., Boulesteix,A.-L., Kneib,T., Augustin,T. and Zeileis,A. (2008) Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9(307), doi:10.1186/1471-2105-9-307.
Tibshirani,R., Hastie, T., Narasimhan, B. and Chu, G. (2002) Diagnosis of multiple cancer types by nearest shrunken centroids of gene expressions. Proc Natl Acad Sci USA, 99, 6567-6572.
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003) Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays. Statistical Science, 18, 104-117.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BATC-0007-0082